# Predicting with ML models

This is all about predicting with ML models, in this case I am looking to predict the passengers that survived the Titanic and see how accurate I can get to correctly identifying the outcome.

In [103]:
import pandas as pd

In [104]:
df = pd.read_csv("train.csv")

In [105]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Understanding the Data

The first thing to do when being given a set of data is to understand the data itself and to familiarize yourself with that data to understand what you are working with. The code **df.info()** shows us information about each column that is present in the data set. We are able to see what kind of data type each column is and helps us work accordingly.

---

## Data Exploration

In [107]:
df["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [108]:
df["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

Both lines illustrate the count of survivors, represented by the number 1, and the deceased, represented by 0. Additionally, they display the count of males and females present.

### Percentage of female survived

In [109]:
round(df["PassengerId"][(df["Sex"]=='female')&(df['Survived']==1)].count()/(df["Sex"]=='female').sum(),2)*100

74.0

### Percentage of male survived

In [110]:
round(df["PassengerId"][(df["Sex"]=='male')&(df['Survived']==1)].count()/(df["Sex"]=='male').sum(),2)*100

19.0

---
## Data Transformation

Machine learning is possible if the below aspects are taken care off

- null values should not exist in the data set
- ML algorithms only work on numerical fields
- remove any unwanted information

In [111]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

This first important step is to see how many null values are present in our data. It is important to treat the null values otherwise prediciting with ML models will not work.

### Removing irrelevant data 

In [112]:
df.drop(["Age", "Cabin", "PassengerId", "Ticket"], axis=1, inplace=True)

In [113]:
df

Unnamed: 0,Survived,Pclass,Name,Sex,SibSp,Parch,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,1,0,7.2500,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",female,0,0,7.9250,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,53.1000,S
4,0,3,"Allen, Mr. William Henry",male,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",male,0,0,13.0000,S
887,1,1,"Graham, Miss. Margaret Edith",female,0,0,30.0000,S
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,23.4500,S
889,1,1,"Behr, Mr. Karl Howell",male,0,0,30.0000,C


After going through the data set the first step is to get rid of any irrelevent data that doesn't add any value towards guessing the outcome, age was taken out because there wasn't enough data provided and the other 3 don't actually have any contribution towards the outcome.

In [114]:
df.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

In [115]:
df["Embarked"].mode()[0]

'S'

In [116]:
df["Embarked"].fillna(df["Embarked"].mode()[0],inplace = True)

Finally there was only one column that remained and only had 2 missing values, this had to do with where people had come from and to fix this problem we took the most repeated values (mode) in the data set and placed that into those 2 missing spots. This gives us the most accurate set as we are unable to tell where they could have come from.

---
### Encoding 
"One Hot Encoding" is the name given to the process of which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. The idea is to change the values in the data to binary so that the ML algorithms are able to predict more accurately.

In [117]:
df = pd.get_dummies(df,columns=["Sex", "Embarked"])

In [118]:
df

Unnamed: 0,Survived,Pclass,Name,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,"Braund, Mr. Owen Harris",1,0,7.2500,0,1,0,0,1
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0,71.2833,1,0,1,0,0
2,1,3,"Heikkinen, Miss. Laina",0,0,7.9250,1,0,0,0,1
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,0,53.1000,1,0,0,0,1
4,0,3,"Allen, Mr. William Henry",0,0,8.0500,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",0,0,13.0000,0,1,0,0,1
887,1,1,"Graham, Miss. Margaret Edith",0,0,30.0000,1,0,0,0,1
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,2,23.4500,1,0,0,0,1
889,1,1,"Behr, Mr. Karl Howell",0,0,30.0000,0,1,1,0,0


This line of code turns the columns "Sex" and "Embarked" into binary. When the person is a male, the category **"female"** would have a 0 indicating false and the **"male"** would have 1.

---
## Last Name Extraction

The point of this last bit is just to view whether the last name of the passengers have any correlation towards the outcome, the lower number of distinct names the lesser the value it holds towards the data.

In [119]:
def extract_lastname(x):
    return x.split(",")[0]

In [120]:
df["lastname"] = df["Name"].apply(lambda x: extract_lastname(x))

In [121]:
df["lastname"].value_counts()

Andersson    9
Sage         7
Panula       6
Skoog        6
Carter       6
            ..
Hanna        1
Lewy         1
Mineff       1
Haas         1
Dooley       1
Name: lastname, Length: 667, dtype: int64

In [122]:
df.drop(["Name", "lastname"], axis=1, inplace=True)

From here we noticed that there wasn't much of a match so we dropped both the columns as they didn't provide any contribution towards the data.

In [123]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   SibSp       891 non-null    int64  
 3   Parch       891 non-null    int64  
 4   Fare        891 non-null    float64
 5   Sex_female  891 non-null    uint8  
 6   Sex_male    891 non-null    uint8  
 7   Embarked_C  891 non-null    uint8  
 8   Embarked_Q  891 non-null    uint8  
 9   Embarked_S  891 non-null    uint8  
dtypes: float64(1), int64(4), uint8(5)
memory usage: 39.3 KB


These are the final columns that will be used to predict with ML models

---
## Variable Selection and Data Split
### Correlation

In [124]:
df[["Survived","Sex_female"]].corr()

Unnamed: 0,Survived,Sex_female
Survived,1.0,0.543351
Sex_female,0.543351,1.0


In [125]:
df.corr()

Unnamed: 0,Survived,Pclass,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
Survived,1.0,-0.338481,-0.035322,0.081629,0.257307,0.543351,-0.543351,0.16824,0.00365,-0.149683
Pclass,-0.338481,1.0,0.083081,0.018443,-0.5495,-0.1319,0.1319,-0.243292,0.221009,0.074053
SibSp,-0.035322,0.083081,1.0,0.414838,0.159651,0.114631,-0.114631,-0.059528,-0.026354,0.068734
Parch,0.081629,0.018443,0.414838,1.0,0.216225,0.245489,-0.245489,-0.011069,-0.081228,0.060814
Fare,0.257307,-0.5495,0.159651,0.216225,1.0,0.182333,-0.182333,0.269335,-0.117216,-0.162184
Sex_female,0.543351,-0.1319,0.114631,0.245489,0.182333,1.0,-1.0,0.082853,0.074115,-0.119224
Sex_male,-0.543351,0.1319,-0.114631,-0.245489,-0.182333,-1.0,1.0,-0.082853,-0.074115,0.119224
Embarked_C,0.16824,-0.243292,-0.059528,-0.011069,0.269335,0.082853,-0.082853,1.0,-0.148258,-0.782742
Embarked_Q,0.00365,0.221009,-0.026354,-0.081228,-0.117216,0.074115,-0.074115,-0.148258,1.0,-0.499421
Embarked_S,-0.149683,0.074053,0.068734,0.060814,-0.162184,-0.119224,0.119224,-0.782742,-0.499421,1.0


In [126]:
independent = ['Pclass','Fare', 'Sex_female','Embarked_C','Embarked_S']
#independent = ['Pclass','Fare', 'Sex_female']

From this point we begin the proccess of predicting, The selected columns have a higher correlation with survived compared to the ones that have been excluded.

---
### Data split

In [152]:
from sklearn.model_selection import train_test_split 
i_train, i_test, d_train, d_test = train_test_split(df[independent], df["Survived"],test_size=0.3,random_state=100)

In [153]:
i_train

Unnamed: 0,Pclass,Fare,Sex_female,Embarked_C,Embarked_S
69,3,8.6625,0,0,1
85,3,15.8500,1,0,1
794,3,7.8958,0,0,1
161,2,15.7500,1,0,1
815,1,0.0000,0,0,1
...,...,...,...,...,...
855,3,9.3500,1,0,1
871,1,52.5542,1,0,1
835,1,83.1583,1,1,0
792,3,69.5500,1,0,1


In [129]:
d_train

69     0
85     1
794    0
161    1
815    0
      ..
855    1
871    1
835    1
792    0
520    1
Name: Survived, Length: 623, dtype: int64

In [130]:
i_test

Unnamed: 0,Pclass,Fare,Sex_female,Embarked_C,Embarked_S
205,3,10.4625,1,0,1
44,3,7.8792,1,0,0
821,3,8.6625,0,0,1
458,2,10.5000,1,0,1
795,2,13.0000,0,0,1
...,...,...,...,...,...
111,3,14.4542,1,1,0
730,1,211.3375,1,0,1
105,3,7.8958,0,0,1
479,3,12.2875,1,0,1


In [131]:
d_test

205    0
44     1
821    1
458    1
795    0
      ..
111    0
730    1
105    0
479    1
277    0
Name: Survived, Length: 268, dtype: int64

The idea is to build a model by predicting on the independent data set (i_train) which can be later used on predicting the test dataset (i_test). The accuracy of the predicition can be easily gauged using the dependent data sets  that contain the outcome we are predicting (d_train, d_test)

---
## Build the Model

In [132]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(i_train,d_train)

LogisticRegression()

In [133]:
model.score(i_test,d_test)

0.7686567164179104

In [134]:
d_predict = model.predict(i_test)

In [135]:
d_predict

array([1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 0])

This is the process of building the ML model, the array of numbers displayed show the predicted outcome of the passengers.

In [136]:
from sklearn.metrics import confusion_matrix
tn_fp,fn_tp = confusion_matrix(d_test, d_predict)

print(f"True Negative: {tn_fp[0]:,}")
print(f"False Positive: {tn_fp[1]:,}")
print(f"False Negative: {fn_tp[0]:,}")
print(f"True Positive: {fn_tp[1]:,}")

True Negative: 134
False Positive: 25
False Negative: 37
True Positive: 72


Now we begin what is known as confusion matrix which helps us understand where the mislabelling has happened, it has been split up into these 4 categories 
- True Positive are the people who were predicted survived and did survive. 
- True Negative were the people who were predicted dead and were deceased. 
- False Postive is the people reported survived but actually deceased. 
- False Negative were people who were reported deceased but actually survived

In [137]:
from sklearn.metrics import classification_report
print(classification_report(d_test, d_predict))

              precision    recall  f1-score   support

           0       0.78      0.84      0.81       159
           1       0.74      0.66      0.70       109

    accuracy                           0.77       268
   macro avg       0.76      0.75      0.76       268
weighted avg       0.77      0.77      0.77       268



## Packaging the Model
The final step is to package the algorithm into a single file which can be used for further prediction if needed such packages can be plucked into web aplications and mobile applications.

In [138]:
import pickle
with open("algorithm.pkl","wb") as file:
  pickle.dump(model,file)

In [139]:
testdf = pd.read_csv("test.csv")

We are basically importing a new data set that does not contain the outcome (Survived or not) in order for this data set to eligible for prediction it needs to go through all the data transformation and encoding steps done previously.

---

- Get rid of unwanted columns
-  Null Value Treatment
-  Encode String to Numeric

In [140]:
testdf.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [141]:
testdf.drop(["Age", "Cabin", "Ticket", "Name"], axis=1, inplace=True)

In [142]:
testdf.isnull().sum()

PassengerId    0
Pclass         0
Sex            0
SibSp          0
Parch          0
Fare           1
Embarked       0
dtype: int64

In [143]:
testdf["Fare"].fillna(testdf["Fare"].median(), inplace=True)

These lines were all about removing the columns that we don't need.

In [144]:
testdf = pd.get_dummies(testdf,columns=["Sex", "Embarked"])

In [145]:
tempdf = testdf[independent]

In [146]:
tempdf

Unnamed: 0,Pclass,Fare,Sex_female,Embarked_C,Embarked_S
0,3,7.8292,0,0,0
1,3,7.0000,1,0,1
2,2,9.6875,0,0,0
3,3,8.6625,0,0,1
4,3,12.2875,1,0,1
...,...,...,...,...,...
413,3,8.0500,0,0,1
414,1,108.9000,1,1,0
415,3,7.2500,0,0,1
416,3,8.0500,0,0,1


Taking only the collumns that we need 

In [147]:
with open("algorithm.pkl",'rb') as file:
    model = pickle.load(file)

In [148]:
prediction = model.predict(tempdf)

In [149]:
testdf.insert(1,"Survived",prediction)

In [150]:
testdf

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,892,0,3,0,0,7.8292,0,1,0,1,0
1,893,1,3,1,0,7.0000,1,0,0,0,1
2,894,0,2,0,0,9.6875,0,1,0,1,0
3,895,0,3,0,0,8.6625,0,1,0,0,1
4,896,1,3,1,1,12.2875,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,0,0,8.0500,0,1,0,0,1
414,1306,1,1,0,0,108.9000,1,0,1,0,0
415,1307,0,3,0,0,7.2500,0,1,0,0,1
416,1308,0,3,0,0,8.0500,0,1,0,0,1


Finally these are the results that we got from all of them, the reason we created this one was because we only needed to take 2 collumns that were important and that was passenger ID and survived.

In [151]:
testdf[['PassengerId','Survived']].to_csv("my_predictions.csv",index=False)

---

# Summary

