# <font color=#023F7C> **Data cleaning and exploration** </font>

<font color=#023F7C>**Hi! PARIS DataBootcamp 2023 🚀**</font> <br>


<img src = https://www.hi-paris.fr/wp-content/uploads/2020/09/logo-hi-paris-retina.png width = "300" height = "200" >

**What is Data Cleaning ?**<br>

Data cleaning is a crucial step in the data analysis and machine learning process, as the quality of the insights and models generated heavily relies on the accuracy and reliability of the underlying data. Raw data often contains **errors**, **inconsistencies**, **missing values**, and **outliers** that can distort results or lead to faulty conclusions. Data cleaning involves identifying and rectifying these issues, ensuring the dataset is trustworthy and suitable for analysis. 

Python provides a robust ecosystem of libraries and tools for data cleaning tasks. <br>
Python's versatility in data cleaning contributes significantly to producing accurate analyses and reliable machine learning models.
- The `Pandas`  library offers functions to handle missing data through imputation or removal, detect and remove duplicates, and transform data types. 
- The `NumPy` library can assist in dealing with outliers by providing statistical methods for outlier detection and filtering. 
- Additionally, visualization libraries like Matplotlib and Seaborn can help visually identify anomalies. 


**Before you start to working on this notebook ⚠️**: <br>
Please download/copy this notebook from `hfactory_magic_folders\course` and drop it into your own directory `my_work` on HFactory. <br>
If you don't, you won't be able to save the modifications you've made on this notebook.

**Business case** 💼: <br>
You've been provided a supply chain dataset by an organization that is trying to improve their supply chain operations.<br>
- The goal of the bootcamp is to use Machine Learning to be able to predict either `Late_delivery_risk` or `Delivery_Status` in the dataset. <br>
- Before building a Machine Learning model, an essential step is to clean and analyze the data with data visualization.



**Need help ? 🙏** <br>
You can go to the Introduction and Intermediate python notebooks to learn how to use the `pandas` library. <br>

## **1. Import libraries and dataset**
First, let's import Python libraries.

In [64]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
pd.set_option('display.max_columns', None) #Show all columns

Then, let's import the dataset `dataset_train.csv` using pandas.


In [65]:
path=r'~/hfactory_magic_folders/course/Dataset/dataset_train.csv'

# Import the csv file
dataset = pd.read_csv(path,encoding='latin-1',sep=';')

## **2. Data discovery**

**Question 1**: <br> 
**Display the dataset's head and tail.**

**Question 2**: <br> **Use the pandas function `.info()` to get general information on the dataset.**<br>

**What can you say about the loaded dataset ?**


**Question 3**:  <br> **Print all the columns/variables of the dataset.**

## **3. Analyze the dataframe's dtypes**
**Question 4**: <br> 
**Create 3 lists, each containing columns names with an int, float and object type.**
- List 1: Columns with an `int64` type
- List 2: Columns with a `float64` type
- List 3: Columns with an `object` type.

*Note: You can use pandas' `.select_dtypes()` function to get columns with a specific dtype.* <br>
*Create a list from a Pandas Dataframe/series with `.to_list()`*

**Question 5**: <br> 
**Compute the number of unique values for the columns with an object and int type.** <br>

*Note: Combine the list with int columns and object columns using the `+` operator*. <br>
*Create a dataframe with the number of unique values and the corresponding variable.*

**Which column/variable has over 15 unique values ?**

**Question 6**: <br>
**Compute the summary statistics of columns with a float type, with pandas' `.describe()` function.** <br>

**Do you detect outliers (weird/abnormal values) in the data ?**

## **4. Analyze missing values**

**Question 7**: <br> **Compute the number of NaN value for every variable/column** <br> 

*Note: A NaN value represents a missing value in a cell of the dataframe* <br>
*You can use the `.isna()` function.*

**Which variables of the dataset has missing values ?**

**Question 8:** <br> 
**Drop the columns of the dataset that have missing values with `.dropna(axis=1)`. <br>**
Don't forget to add `.reset_index(drop=True)` after dropping the NaN values in the dataframe !

**If you don't want to drop rows, you can replace the missing values in each variable** <br>
Try the following methods only if the variable has a small number of NaN values (less than 10%).
- Replace with the mean or median value for continuous variables (mostly columns with a float dtype)
- Replace with the variable's most frequent value (`.mode()`) or by creating a new category for categorical variables (mostly columns with an int/object dtype)

You can drop the variables with a high number of missing values.

At this step, the dataset shouldn't have any missing values (you can check with `.isna().sum().sum()`)


**Question 9**: <br>
We want to share this dataset with a customer. Can we do this without modifying it? <br>
If not, drop the element/variables that we have to delete

**Question 10**: <br>
**Save the cleaned dataframe as a csv file called `dataset_train_clean.csv` using pandas' `.to_csv()` function.** <br>
*Note: Make sure to add `index=False` to the `.to_csv()` function or else the index of the dataframe will be saved too.*