S12 T01: Pipelines, grid search i text mining

# Statement
Let's start getting familiar with Pipelines, grid search and text mining !!!! Let's start with some basic exercises

Level 1

- Exercise 1

Take the dataset you prefer and perform a pipeline and a gridsearch applying the Random Forest algorithm.

- Exercise 2

Take any English text you choose, and calculate the frequency of the words.

Level 2

- Exercise 1

Find the stopwords and perform stemming on your dataset.

Level 3

- Exercise 1

Perform sentiment analysis on your set of words.

Resources
Classroom resources and <https://www.nltk.org>

# Preprocessing

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer 
from sklearn.model_selection import GridSearchCV


In [2]:
# settings to display all columns (default is 20, now is None (all))
pd.set_option("display.max_columns", None)

In [3]:
# Import cleaned and sampled train an test dataset from previous Task.
df_train = pd.read_csv('..\data\DelayedFlights_train.csv')
df_test  = pd.read_csv('..\data\DelayedFlights_test.csv')

### Explanation of the Train / Test Sample 
* Is imported from previous Task (S09T01).  
* Is 1% of the original dataset, randomly sampled and stratified by Airline.
* Is parted 33% test and 66% train.

In [4]:
# Let's delete the first column
df_train = df_train.drop(columns='Unnamed: 0')
df_test  = df_test.drop(columns='Unnamed: 0')

### Deletion of DepDelay attribute 


In [5]:
# Let's delete the first column
df_train = df_train.drop(columns='DepDelay')
df_test  = df_test.drop(columns='DepDelay')

### Concatenation Train / Test sets.

In [6]:
#Let's concatenate df_train & df_test
df_complete = pd.concat([df_train,df_test])
print(df_complete.shape)
df_complete.head()

(19283, 33)


Unnamed: 0,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,Distance,TaxiIn,TaxiOut,Date,UniqueCarrier_9E,UniqueCarrier_AA,UniqueCarrier_AQ,UniqueCarrier_AS,UniqueCarrier_B6,UniqueCarrier_CO,UniqueCarrier_DL,UniqueCarrier_EV,UniqueCarrier_F9,UniqueCarrier_FL,UniqueCarrier_HA,UniqueCarrier_MQ,UniqueCarrier_NW,UniqueCarrier_OH,UniqueCarrier_OO,UniqueCarrier_UA,UniqueCarrier_US,UniqueCarrier_WN,UniqueCarrier_XE,UniqueCarrier_YV
0,2,1.396108,1.553827,1.265258,1.312513,-0.781188,-0.790736,-0.818653,-0.423085,-0.68307,-0.341142,0.130433,2008-07-29,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,2,-0.699602,-0.997079,-0.540408,-0.920752,-0.90502,-0.832512,-0.804233,0.577369,-0.783172,-0.154948,-0.654312,2008-10-07,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,4,0.049503,0.125887,0.575125,0.600955,0.402102,0.490397,0.551251,-0.536344,0.708,-0.713531,-0.36895,2008-06-26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,-0.445441,-0.394224,0.222755,0.198024,0.801118,0.936008,0.868492,-0.630727,0.918559,-0.341142,-0.012248,2008-04-10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,5,0.004914,-0.082158,0.184414,-0.012015,-0.340894,-0.317274,-0.227431,0.067703,-0.334439,0.21744,-0.725652,2008-01-11,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [7]:
# Our Y or Target is ArrDelay:
y = df_complete.ArrDelay
y = y.array # Convert pandas series to numpy array
type(y)

pandas.core.arrays.numpy_.PandasArray

In [8]:
# Our X now is going to be all the columns in df_complete except ArrDelay and Date (also the OHE of Airline)
X = df_complete.drop(columns=["ArrDelay","Date"])
feature_list = list(X.columns) # Saving feature names for later use
X = X.to_numpy() # Convert dataframe to array
type(X)

numpy.ndarray

# Level 1

## - Exercise 1 - Random Forest Regression

Take the dataset you prefer and perform a pipeline and a gridsearch applying the Random Forest algorithm.

We are going to use the Delayed Flights dataset, without feature DepDelay.

First, let's calculate the Accuracy of Random Forest model with default parameters.

In [9]:
# Instantiate model with 10 decision trees
model = RandomForestRegressor(n_estimators = 10, random_state = 42)
# Train the model on training data
model.fit(X, y)
# Calculate R2 to see the accuracy of the model with all data.
r_sq_train = model.score(X,y)
print('coefficient of determination with all data: %.3f' %r_sq_train)

coefficient of determination with all data: 0.988


In [10]:
# Applying k-Fold Cross Validation (CV) with all data
accuracies = cross_val_score(estimator = model, X=X, y=y , cv = 10) # default
print("Random Forest Regression:\n Accuracy with train data: %.3f"%accuracies.mean(), "+/- %3.f"%accuracies.std(),"\n")

Random Forest Regression:
 Accuracy with train data: 0.932 +/-   0 



The Random Forest Model, with n_estimators = 10 (default), has an accuracy of 0.932 (quite good)  

Let's find out now if we have an unbalanced dataset or not.

In [11]:
df_X = pd.DataFrame(data=X,columns=feature_list)
df_X.head(5)

Unnamed: 0,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,ActualElapsedTime,CRSElapsedTime,AirTime,Distance,TaxiIn,TaxiOut,UniqueCarrier_9E,UniqueCarrier_AA,UniqueCarrier_AQ,UniqueCarrier_AS,UniqueCarrier_B6,UniqueCarrier_CO,UniqueCarrier_DL,UniqueCarrier_EV,UniqueCarrier_F9,UniqueCarrier_FL,UniqueCarrier_HA,UniqueCarrier_MQ,UniqueCarrier_NW,UniqueCarrier_OH,UniqueCarrier_OO,UniqueCarrier_UA,UniqueCarrier_US,UniqueCarrier_WN,UniqueCarrier_XE,UniqueCarrier_YV
0,2.0,1.396108,1.553827,1.265258,1.312513,-0.781188,-0.790736,-0.818653,-0.68307,-0.341142,0.130433,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,-0.699602,-0.997079,-0.540408,-0.920752,-0.90502,-0.832512,-0.804233,-0.783172,-0.154948,-0.654312,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.049503,0.125887,0.575125,0.600955,0.402102,0.490397,0.551251,0.708,-0.713531,-0.36895,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,4.0,-0.445441,-0.394224,0.222755,0.198024,0.801118,0.936008,0.868492,0.918559,-0.341142,-0.012248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,5.0,0.004914,-0.082158,0.184414,-0.012015,-0.340894,-0.317274,-0.227431,-0.334439,0.21744,-0.725652,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



As our dataset needs to be standardize the feature "DayOfWeek",  
we are going to apply pipeline with first the standardize and afterwards applying GridSearch,    
to increase the Cross Validation Accuracy by tuning hyper-parameters.

Definition of **pipeline** class according to scikit-learn is:   
>*Sequentially apply a list of transforms and a final estimator.*  
>*Intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement fit.*    



In [12]:
# Pre-processing step
# Scale the data in the column Item_MRP
pre_process = ColumnTransformer([('scale_data', StandardScaler(with_mean =True, with_std =True),['DayOfWeek'])])

### Pipe Line

In [13]:
# Define the Pipeline

"""
Step1: Pre-processing
Step2: Train a Random Forest Model 
"""
steps = [('std', pre_process ), ('random_forest', RandomForestRegressor(max_depth=10,random_state=2))]
model_pipeline = Pipeline(steps) # define the pipeline object.

# Fit the pipeline with the training data
model_pipeline.fit(df_X,y)

# Predict target values on the training data
y_train = model_pipeline.predict(df_X)

### Grid Search

In [14]:
# GridSearch
tuned_parameters = {
    'classifier__n_estimators': [5,10,15], # default is 100, but as our actual accuracy with 10 is quite good (0.932), we use lower estimators in order to increase speed.
    'classifier__max_features': ['auto','sqrt','log2'],
    'classifier__class_weight': ['balanced',None],
}

hyper_tunning = GridSearchCV(estimator = model_pipeline, param_grid = tuned_parameters , cv =3, n_jobs = -1, verbose = 4)

hyper_tunning.fit(df_X,y)    

Fitting 3 folds for each of 18 candidates, totalling 54 fits
