<img src="../assets/images/Cover.png" alt="Cover" title="AI2E Cover" />

## AI2E - [Workshop 2] - [Data exploration] 

We will go through the essential steps to explore and get the most benefit from our data. 

### Content 
1. Get started with data
2. Feature selection
3. Feature engineering
4. Trait missing values
5. Data visualisation
6. Handle outliers
7. Encode data
8. Scaling
9. Conclusion

### 1. Get started with data:
The dataset provided by an algerian company includes variables about adress of depart, adress of arrival, distance, ... 
    The training dataset provided here is a subset of over 60,000 samples.

#### variables description
<img src="../assets/images/w2_Vdesc.PNG" title="variables description" />


In [None]:
# imports
import pandas as pd
from datetime import date
from matplotlib import pyplot as plt

In [None]:
# read the data
df = pd.read_csv('data/vtc_data.csv')
df.head()

In [None]:
#get information about integer values
df.describe()

<b>std: </b>the standard deviation is a measure of the amount of variation or dispersion of a set of values


In [None]:
#get an overall look about the data
df.info()

In [None]:
#check the duration of the whole data
print("First date : ", df["date_of_travel"].min())
print("Last date : ", df["date_of_travel"].max())

### 2. Feature selection
Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.  
<b> Irrelevant or partially relevant features can negatively impact model performance. </b>

In [None]:
#exo: Define our features and target

In [None]:
chosen_features = []
target_name = ""

In [None]:
features = df[chosen_features]
labels = df[target_name]

### 3. Feature engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

In [None]:
# add hour and day columns

features['date_of_travel'] = pd.DataFrame(pd.to_datetime(features.date_of_travel, format="%Y-%m-%d %H:%M:%S"))

# create a new column
features["hour"] = features["date_of_travel"].dt.hour
features["day_name"] = features["date_of_travel"].dt.day_name()
# drop the "date of travel" column
features.drop(["date_of_travel"], axis = 1, inplace = True)
features.head()

In [None]:
# separate the lat and lon columns
features[['lat','lon']] = features.lat_and_long_of_arrival_address.str.split(",", expand=True)
features["lat"] = pd.to_numeric(features["lat"], downcast="float")
features["lon"] = pd.to_numeric(features["lon"], downcast="float")
features.drop(["lat_and_long_of_arrival_address"], axis = 1, inplace = True)
features.head()

### 4. Trait missing values
Missing values are one of the most common problems you can encounter when you try to prepare your data for machine learning. The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, and so on. Whatever is the reason, missing values affect the performance of the machine learning models.

In [None]:
# check if there are null values in each column
for col in features.columns :
    print(col,':' ,features[col].isnull().sum())

In [None]:
#trait null values
features["distance"].fillna(features["distance"].mean(), inplace = True)
features["lat"].fillna(features["lat"].mean(), inplace = True)
features["lon"].fillna(features["lon"].mean(), inplace = True)

In [None]:
# exo: confirm that we don't have null values

### 5. Data visualisation
Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images.

In [None]:
# heatmap to show correlation
import seaborn as sns
# we need to use the labels column
corr = pd.concat([features, labels], axis = 1).corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

<b>correlation: </b> refers to the degree to which a pair of variables are linearly related.

In [None]:
# print a chart of the estimated time in function of the distance
plt.scatter(features['distance'], labels)
plt.xlabel("distance")
plt.ylabel("estimated time")
plt.show()

### 6. Handle outliers
Before mentioning how outliers can be handled, I want to state that the best way to detect the outliers is to demonstrate the data visually. All other statistical methodologies are open to making mistakes, whereas visualizing the outliers gives a chance to take a decision with high precision.

In [None]:
# exo: remove all the points which have estimated time > 400

In [None]:
# exo: re-print the chart to confirm

<b>Note: </b> at this stage and with the informations that we have, we can do features selection again, and that for choosing <b> the best </b> of the features.

### 7. Encode data
One-hot encoding is one of the most common encoding methods in machine learning. This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded column.

In [None]:
#Encode data
def oneHotEncode(df, col):
    dfDummies = pd.get_dummies(df[col], prefix = col)
    df = pd.concat([df, dfDummies], axis=1)
    
    # write youe code here
    
    return df

In [None]:
# exo: complete the oneHotEncode function (drop the current column)

In [None]:
# applying the function to our data
features = oneHotEncode(features, 'travel_type')
features = oneHotEncode(features, 'car_type')
features = oneHotEncode(features, 'day_name')
features.head()

### 8. Scaling
In most cases, the numerical features of the dataset do not have a certain range and they differ from each other. In real life, it is nonsense to expect age and income columns to have the same range. But from the machine learning point of view, how these two columns can be compared?  
<b> Scaling </b> solves this problem. The continuous features become identical in terms of the range, after a scaling process.

In [None]:
#scale data
def scale(df, cols):     
    for col in cols:
        
        # write your code here: 
    return df
features = scale(features, ["distance", "hour"]) 
features.head()

In [None]:
# exo: apply min-max scaling in the previous function

### 9. Conclusion

You are now capable to exploit data and extract the most useful informations from it.

in the next lesson you will learn how to use the our final data to create and train a model.