# Basic train test split exercise

In this notebook, you will explore the key concepts of the train-test split, a critical step for evaluating the generalization performance of supervised machine learning models. You will also build a simple pipeline and generate predictions to reinforce your understanding of this process.

For this exercise, we will use a data file called **spam**. This file contains statistical information about the emails received in an email server with a label classifying them as spam. In the following link, you can download it: https://github.com/jnin/information-systems/raw/main/data/mortgages.csv

The data is presented in a CSV format. The last column, named `spam` denotes whether the e-mail was considered spam (1) or not (0). The remaining columns contain information about the frequency of specific words, symbols and text structures. 

<div class="alert alert-info"><b>Exercise 1</b> 

Create a dataframe called ```df``` that contains the provided data. Extract the features matrix and target array from ```df``` and store them in two new variables called ```X```and ```y```, respectively. 

</div>

In [2]:
import pandas as pd
import numpy as np


In [3]:

df = pd.read_csv('https://raw.githubusercontent.com/jnin/information-systems/main/data/spam.csv')
X = df.drop(columns='spam')
y = df['spam']
df


Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_hash,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4205,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4206,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4207,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4208,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


<div class="alert alert-info"><b>Exercise 2</b> 

Write code to display the column names of the feature matrix, check for any missing values, and use the `Pandas` function `describe()` to summarize the features. Based on this summary, determine the most appropriate standardization method for scaling the features.
</div>

In [4]:
X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
word_freq_make,4210.0,0.104366,0.300005,0.0,0.0,0.0,0.0,4.54
word_freq_address,4210.0,0.112656,0.45426,0.0,0.0,0.0,0.0,14.28
word_freq_all,4210.0,0.291473,0.515719,0.0,0.0,0.0,0.44,5.1
word_freq_3d,4210.0,0.063078,1.352487,0.0,0.0,0.0,0.0,42.81
word_freq_our,4210.0,0.325321,0.687805,0.0,0.0,0.0,0.41,10.0
word_freq_over,4210.0,0.096656,0.27603,0.0,0.0,0.0,0.0,5.88
word_freq_remove,4210.0,0.117475,0.397284,0.0,0.0,0.0,0.0,7.27
word_freq_internet,4210.0,0.108,0.410282,0.0,0.0,0.0,0.0,11.11
word_freq_order,4210.0,0.09186,0.282144,0.0,0.0,0.0,0.0,5.26
word_freq_mail,4210.0,0.24842,0.656638,0.0,0.0,0.0,0.19,18.18


In [5]:
pd.set_option('display.max_columns',57,'display.width',170)
print(X.columns)
print(X.isna().sum())
print(X.describe())

Index(['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our', 'word_freq_over', 'word_freq_remove', 'word_freq_internet',
       'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free',
       'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money',
       'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet', 'word_freq_857', 'word_freq_data',
       'word_freq_415', 'word_freq_85', 'word_freq_technology', 'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct', 'word_freq_cs',
       'word_freq_meeting', 'word_freq_original', 'word_freq_project', 'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference', 'char_freq_;',
       'char_freq_(', 'char_freq_[', 'cha

In [6]:
X.eq(0).sum()

word_freq_make                3228
word_freq_address             3399
word_freq_all                 2426
word_freq_3d                  4164
word_freq_our                 2551
word_freq_over                3277
word_freq_remove              3448
word_freq_internet            3443
word_freq_order               3488
word_freq_mail                2981
word_freq_receive             3559
word_freq_will                2009
word_freq_people              3404
word_freq_report              3874
word_freq_addresses           3905
word_freq_free                3054
word_freq_business            3304
word_freq_email               3238
word_freq_you                 1146
word_freq_credit              3821
word_freq_your                1925
word_freq_font                4098
word_freq_000                 3591
word_freq_money               3551
word_freq_hp                  3146
word_freq_hpl                 3421
word_freq_george              3543
word_freq_650                 3758
word_freq_lab       

<div class="alert alert-info"><b>Exercise 3</b> 

Write code to display the distribution of labels in the target array `y`.

</div>

In [7]:
y.value_counts()

spam
0    2531
1    1679
Name: count, dtype: int64

<div class="alert alert-info"><b>Exercise 4</b> 
    
Write code to normalize the data using an appropriate scaler and train a `LogisticRegression` model. To do this, create a pipeline containing both steps. After fitting the model, evaluate its performance by checking the `score` of the trained model.

</div>

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe=Pipeline([('scaler',StandardScaler()),('log_model',LogisticRegression())])
pipe.fit(X,y)
print(f"Logarithmic accuracy: {pipe.score(X,y)}")

Logarithmic accuracy: 0.9254156769596199


<div class="alert alert-info"><b>Exercise 5</b> 
    
Now, create a second pipeline, this time using a `RandomForestClassifier`. After training the model, evaluate its performance by checking the score.

</div>
<div class="alert alert-warning">

In the following weeks, we will dive deeper into how the Random Forest classifier works. For today, simply compute its score and draw some initial conclusions about its performance.

</div>

In [16]:
from sklearn.ensemble import RandomForestClassifier

pipe_forest=Pipeline([('scaler',StandardScaler()),('Random',RandomForestClassifier(random_state=42))])
pipe_forest.fit(X,y)
print(f"Random forest accuracy: {pipe_forest.score(X,y)}")

Random forest accuracy: 0.9992874109263657


<div class="alert alert-info"><b>Exercise 6</b> 
    
Write the code to split the datasets ```X``` and ```y``` into separate training set and a test set using the sklearn library. Use the common names ```X_train, X_test, y_train, y_test```. To ensure reproducibility, use the parameter `random_state=42` when calling the `train_test_split` function.

</div>

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)

<div class="alert alert-info"><b>Exercise 7</b> 

Finally, use both pipelines to correctly fit the training data and make predictions on the test data. Analyze the results and explain why the accuracies are lower in this case.

</div>

In [18]:
pipe.fit(X_train, y_train)
y_predict = pipe.predict(X_test)
pipe_score = pipe.score(X_test, y_test)

pipe_forest.fit(X_train, y_train)
y_predict_forest = pipe_forest.predict(X_test)
pipe_forest_score = pipe_forest.score(X_test, y_test)

print(f"Log model accuracy: {pipe_score} \nForest model accuracy: {pipe_forest_score}")

Log model accuracy: 0.912630579297246 
Forest model accuracy: 0.9430199430199431
