**Projet interpormo 2022**

**Groupe 5 : Génération de nouvelles observations**

**Travail préparatoire : Data Preprocessing and Classification**

This is a simple tutorial for Data Preprocessing before model training. 

Here we will still use the titanic dataset and I will give you a simple example of a machine learning algorithm for classification. I have chosen this example because in our group we will have to generate a classification algorithm that will be able to detect whether our data is fake or real.

What I am proposing here is a simple example, there are different methods of pre-processing the data and different machine learning algorithms for classification.  So you can try to improve the results by changing the method and/or using other algorithms to increase the results using the documentation of the [scikit-learn library](https://scikit-learn.org).

# Data Preprocessing

**Preprocessing :**


Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.

## Importing the libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

## Importing the dataset


In [2]:
df = pd.read_csv('titanic.csv')

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Taking care of missing data

In [4]:
# Create table for missing data analysis
def missing_data_table(df):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum() /
               df.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent],
                             axis=1,
                             keys=['Total', 'Percent'])
    return missing_data

In [5]:
missing_data_table(df)

Unnamed: 0,Total,Percent
Cabin,687,0.771044
Age,177,0.198653
Embarked,2,0.002245
Fare,0,0.0
Ticket,0,0.0
Parch,0,0.0
SibSp,0,0.0
Sex,0,0.0
Name,0,0.0
Pclass,0,0.0


In [6]:
# Drop columns
df.drop(columns='Cabin', inplace=True)  #too many missing values
#dropping Columns which are not useful
df.drop(columns=['PassengerId', 'Name', 'Ticket'], inplace=True)

Here I make the choice to delete ss some columns assuming that they do not affect whether or not they survive. 


One modification would be to try to create new variables ( feature engeniring ) from them. 


Example : For the *Name* feature we find 'Mr', 'Mrs', 'Miss', 'Master', 'Dr' which can be useful... 

In [7]:
missing_data_table(df)

Unnamed: 0,Total,Percent
Age,177,0.198653
Embarked,2,0.002245
Fare,0,0.0
Parch,0,0.0
SibSp,0,0.0
Sex,0,0.0
Pclass,0,0.0
Survived,0,0.0


In [8]:
#replacement of missing values by the mean
df.Age.fillna(df.Age.mean(), inplace=True)
#replacement of missing values by the mode
df.Embarked.fillna(df.Embarked.mode()[0], inplace=True)

It is possible to replace missing values using other methods, for example for age by considering the class in the mean...

In [9]:
missing_data_table(df)

Unnamed: 0,Total,Percent
Embarked,0,0.0
Fare,0,0.0
Parch,0,0.0
SibSp,0,0.0
Age,0,0.0
Sex,0,0.0
Pclass,0,0.0
Survived,0,0.0


## Encoding categorical data

Many machine learning algorithms cannot support categorical values without being converted to numerical values.

In [10]:
#Converting to numerical values
df = pd.get_dummies(df, columns=['Sex', 'Pclass', 'Embarked'])

In [11]:
df.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,0,1,0,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,1,0,0,1,0,0
2,1,26.0,0,0,7.925,1,0,0,0,1,0,0,1
3,1,35.0,1,0,53.1,1,0,1,0,0,0,0,1
4,0,35.0,0,0,8.05,0,1,0,0,1,0,0,1


Now our data is free of missing values, categorical data and unwanted columns and is ready to be used to train our model.

## Splitting the dataset into the Training set and Test set

Now that we are ready with X and y, lets split the dataset for 80% Training and 20% test set using [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split)

In [12]:
X = df.drop(columns='Survived')
y = df['Survived'] # variable to be predicted 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=41)

## Feature Scaling

Scaling is not mandatory, but it performs better to scale the data before some machine learning algorithms. 

Here we will use [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler), if you wish to use another method look at [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

In [13]:
X_train.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
628,26.0,0,0,7.8958,0,1,0,0,1,0,0,1
300,29.699118,0,0,7.75,1,0,0,0,1,0,1,0
663,36.0,0,0,7.4958,0,1,0,0,1,0,0,1
50,7.0,4,1,39.6875,0,1,0,0,1,0,0,1
846,29.699118,8,2,69.55,0,1,0,0,1,0,0,1


In [14]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train[['Age', 'SibSp', 'Parch',
         'Fare']] = sc.fit_transform(X_train[['Age', 'SibSp', 'Parch',
                                              'Fare']])
X_test[['Age', 'SibSp', 'Parch',
        'Fare']] = sc.transform(X_test[['Age', 'SibSp', 'Parch', 'Fare']])

We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.

This is the standart procedure to scale. You always learn your scaling parameters on the train and then use them on the test.

In [15]:
X_train.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
628,-0.278119,-0.474848,-0.474065,-0.518903,0,1,0,0,1,0,0,1
300,0.000957,-0.474848,-0.474065,-0.522131,1,0,0,0,1,0,1,0
663,0.47632,-0.474848,-0.474065,-0.527759,0,1,0,0,1,0,0,1
50,-1.711552,2.983891,0.718635,0.184907,0,1,0,0,1,0,0,1
846,0.000957,6.44263,1.911335,0.846008,0,1,0,0,1,0,0,1


# Trainning

There are different methods of classification such as : 

- Knn
- Random Forest Classifier
- Logistic Regression
- SVM
- Decision Tree Classifier
- ...

In this example I decided to use [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier)

In [16]:
random_forest = RandomForestClassifier(
    n_estimators=100,
    max_depth=6)  #read the documentation to understand the parameters
random_forest.fit(X_train, y_train)  # trainning on the trai set
Y_pred = random_forest.predict(X_test)  # prediction on the test set

In [17]:
cm = confusion_matrix(y_test, Y_pred)
print(cm)

[[101   4]
 [ 26  48]]


In [18]:
print(' Train accuracy : ', random_forest.score(X_train, y_train))
print(' Test accuracy : ', random_forest.score(X_test, y_test))

 Train accuracy :  0.8623595505617978
 Test accuracy :  0.8324022346368715


**Try to imporve my results :** 