- Author: Your Name
- First Commit: yyyy-mm-dd                      #folowing ISO  8601 Format
- Last Commit: yyyy-mm-dd                       #folowing ISO  8601 Format
- Description: This notebook is used to perform EDA on the "xxxxx" dataset

## Key Steps in this Data Analysis:

1. **Framing the Question:** 
   - The first step towards any sort of data analysis is to ask the right question(s) from the given data. 
   - Identifying the objective of the analysis makes it easier to decide on the type(s) of data needed to draw conclusions.

2. **Data Wrangling:** 
   - Data wrangling, sometimes referred to as data munging or data pre-processing, is the process of gathering, assessing, and cleaning "raw" data into a form suitable for analysis.

3. **Exploratory Data Analysis (EDA):** 
   - Once the data is collected, cleaned, and processed, it is ready for analysis. 
   - During this phase, you can use data analysis tools and software to understand, interpret, and derive conclusions based on the requirements.

4. **Drawing Conclusions:** 
   - After completing the analysis phase, the next step is to interpret the analysis and draw conclusions. 
   - Three key questions to ask at this stage:
     - Did the analysis answer my original question?
     - Were there any limitations in my analysis that could affect my conclusions?
     - Was the analysis sufficient to support decision-making?

5. **Communicating Results:** 
   - Once data has been explored and conclusions have been drawn, it's time to communicate the findings to the relevant audience. 
   - Effective communication can be achieved through data storytelling, writing blogs, making presentations, or filing reports.

**Note:** The five steps of data analysis are not always followed linearly. The process can be iterative, with steps revisited based on new insights or requirements that arise during the analysis.


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import os
import warnings

## 1. Data Wrangling

### 1.1 Gathering data

In [None]:
# import csv files 
df = pd.read_csv('../datasets/df.csv')

### 1.2 Assessing of Data

In [None]:
# Take a look of the data´s shape
display(df.shape)

In [None]:
# Take a look of the data´s info
display(df.info())

In [None]:
# Take a look of the data´s head
display(df.head())

In [None]:
# Search for NULL values
display(customers.isnull().sum())

In [None]:
# Check the type of information that every column has
display(df.dtypes)

### 1.3 Data Cleaning

### 1.3.1  Remove irrelevant data

In [None]:
# Remove irrelevant data in the 'x' column. (i.e: "In this case, we will drop the colums that have more null values than valid values")
df = df.drop(columns=['column1', 'column2'])

### 1.3.2  Remove/replace null values
    

In [None]:
# Calculate the mean of the columns with missing values 
# Check datatype of the columns
display(df['column1'].dtype)  

# Transform the columns into datetime type
df['column1'] = pd.to_datetime(df['column1'])

In [None]:
# Calculate the mean of this columns
mean_df = df['column1'].mean()
display(mean_df)

In [None]:
# Imput the corresponding values to the null values in this columns with the mean (or the best method for each case)
# check if null values exist in order_data dataset
df.isnull().sum()['column1'].fillna(mean_df, inplace=True)


In [None]:
# Check if null values exist in order_data dataset
df.isnull().sum()

#### 1.3.2.2 Dataset's columns types

In [None]:
# Calculate the mean of the columns with missing values 
# Check datatype of the columns (they need to be in this case datime type)
display(df['column1'].dtype)                 

In [None]:
# Calculate the mean of this columns
mean_column1 = df['column1'].mean()

In [None]:
# Imput the corresponding values to the null values in this columns with the mean
df['column1'].fillna(mean_column1, inplace=True)

In [None]:
# Check again if null values exist in order_data dataset
df.isnull().sum()

### 1.3.3 Drop the duplicates, if any

Use duplicate() function to find duplicated data in the datasets

In [None]:
# Find duplicates based on all columns
display(df[df.duplicated()].sum())

In [None]:
# drop duplicates in df 
df = df.drop_duplicates()

### 1.3.4 Type conversion

Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds, and so on).
In this case we already did this.

### 1.3.5 Syntax Errors

### 1.3.6 Outliers

They are values that are significantly different from all other observations. Any data value that lies more than (1.5 * IQR) away from the Q1 and Q3 quartiles is considered an outlier.

In general, an e-commerce dataset obtained from a well-functioning system is less likely to have outliers compared to datasets that involve manual data entry or measurement errors. E-commerce datasets typically capture transactional information, such as customer details, product information, and order-related data, which are less prone to outliers.

However, it's still possible to have outliers in certain scenarios, such as:

Data entry errors: Although automated systems minimize data entry errors, there can still be instances where incorrect or extreme values are recorded.

Measurement errors: If the dataset includes measurements or quantitative data collected manually, there may be measurement errors leading to outliers.

System glitches or anomalies: While rare, system glitches or anomalies can occasionally result in outliers in the data.

Fraudulent activities: In some cases, fraudulent transactions or activities may introduce outliers into the dataset.

Therefore, while it's reasonable to assume that the occurrence of outliers in an e-commerce dataset is relatively low, it's still advisable to examine the data and apply outlier detection techniques to ensure data quality and integrity.

Remember that outlier detection is an iterative process, and there is no one-size-fits-all approach. It requires a combination of domain knowledge, data understanding, and experimentation to determine the most suitable method and threshold for your specific dataset and analysis objectives.

In [None]:
# Create a function to find otliers
dataframes = [df, df1, df2, df3,
              df4]

for df in dataframes:
    # Identify numerical columns
    numerical_columns = df.select_dtypes(include=np.number).columns

    # Define percentiles for outlier detection (e.g., values outside [5th percentile, 95th percentile])
    lower_percentile = 5
    upper_percentile = 95

    for column in numerical_columns:
        # Calculate percentiles for the column
        lower_threshold = np.percentile(df[column], lower_percentile)
        upper_threshold = np.percentile(df[column], upper_percentile)

        # Find rows with outliers in the column
        outlier_rows = (df[column] < lower_threshold) | (df[column] > upper_threshold)

        # Print rows with outliers in the column
        print(df[outlier_rows])
        print('\n')

After performing outlier analysis we indentified that because of the nature of the info, there are not relevant outliers, althought numerically there are some that exists. 

### 1.3.7 In-record & cross-datasets errors

These errors result from having two or more values in the same row or across datasets that contradict with each other. For example, if we have a dataset about the cost of living in cities. The total column must be equivalent to the sum of rent, transport, and food.

## Export the cleaned data

In [None]:
# Create a function to export the cleaned data to perform EDA in the future
# Specify the path to the dataset folder
folder_path = "../datasets/"

dataframes = [df, df1, df2, df3,
              df4]

file_names = ['df.csv', 'df1.csv', 'df3.csv',
              'df4.csv']

for df, file_name in zip(dataframes, file_names):
    # Add the prefix "cleaned_" to the file name
    cleaned_file_name = "cleaned_" + file_name
    # Get the full path of the output file
    output_file_path = os.path.join(folder_path, cleaned_file_name)
    # Save the cleaned DataFrame to CSV
    df.to_csv(output_file_path, index=False)