# AI 2024 Online Summer Internship
### Name: Rasikh Ali
### Email: rasikhali1234@gmail.com

## System & Libraries
<div class="alert alert-block alert-success">
    Using <b>Python </b> v3.8.18
    <br>
    with <b>Jupyter </b> v7.4.9
</div>
<div class="alert alert-block alert-info">
    
|    Libarries    | Version |     Purpose     |
|-----------------|---------|-----------------|
|Pandas           | v1.4.2  | Used for Data Manipulation and Analysis | 
|Numpy            | v1.23.5 | Used for Array Manipulation             |
|Pickle           | v4.0    | Used for Saving and Loading Model       |
|LabelEncoder     |         | Used for Encoding Categorical Features  |
|SVM              |         | Classifier for Classification, Regression, Outlier Detection  |
|Accuracy_Score   |         | Used for Calculating Accuracy Score of a Model                |
|Train_Test_Split |         | Used for Splitting array/matrices into train,test subsets     |
    
</div>

In [1]:
import pandas as pd
import numpy as np
import pickle

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

C:\Users\ABC\anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
C:\Users\ABC\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


## Loading Dataset

In [2]:
sample_data = pd.read_csv('dataset/train.csv')
sample_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Understanding Dataset

<div class="alert alert-block alert-info">
About <b>Dataset</b>.

| Attributes | Description  |
|------------|--------------|
|passengerId | Passenger's Id (in dataset, irrelevant)                        |
|survived    | Passenger Survival (0/1 = NotSurvived/Survived)                |
|pclass      | Passenger's Ticket class (Upper, Middle, Lower = 0, 1, 2)      |
|sex         | Passenger's Gender (male, female)                              |
|age         | Passenger's Age                                                |
|sibsp       | # of Siblings / Spouses onboard                                |
|parch       | # of Parents / Children onboard                                |
|ticket      | Passenger's Ticket Number                                      |
|fare        | Passenger's Fare                                               |
|cabin       | Passenger's Cabin Number                                       |
|embarked    | Port of Embarkation (C,Q,S = Cherbourgh,Queenstown,Southampton)|
    
</div>


In [3]:
print("-- Attributes in Sample Data --")
for cols in sample_data.columns:
    print(cols)

-- Attributes in Sample Data --
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked


In [4]:
print("-- Number of instances in Sample Data --")
print(sample_data.count())

-- Number of instances in Sample Data --
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64


In [5]:
print("-- Number of Unique Values in Sample Data --")
print(sample_data.nunique())

-- Number of Unique Values in Sample Data --
PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64


In [6]:
print("-- Number of Null Values in Sample Data --")
print(sample_data.isnull().sum())

-- Number of Null Values in Sample Data --
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [7]:
print("-- Insights of Sample Data --")
sample_data.info()

-- Insights of Sample Data --
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Pre-Processing

#### Managing Null Values

<div class="alert alert-block alert-success">
    As the Attribute: Age is <b>Numerical</b> and not Categorical, We'll use Median (Average), Also convert it into int
</div>

In [8]:
sample_data['Age'] = sample_data['Age'].fillna(sample_data['Age'].median())

In [9]:
sample_data['Age'] = sample_data['Age'].astype('int')

In [10]:
print("-- No More Null Values in Age --")
print(sample_data.isnull().sum())

-- No More Null Values in Age --
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


<div class="alert alert-block alert-success">
    As the Attribute: Embarked is <b>Textual</b> and Categorical, We'll use Mode (Most Frequent)
</div>

In [11]:
sample_data['Embarked'] = sample_data['Embarked'].fillna(sample_data['Embarked'].mode()[0])

In [12]:
print("-- No More Null Values in Embarked --")
print(sample_data.isnull().sum())

-- No More Null Values in Embarked --
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64


<div class="alert alert-block alert-danger">
More than <b>50%</b> of Cabin's values are Null, so we'll drop that column 
</div>

In [13]:
sample_data = sample_data.drop(columns=['Cabin'], axis=1)
sample_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,S


<div class="alert alert-block alert-danger">
Also dropping <b>Irrelevant</b> Columns
</div>

In [14]:
sample_data = sample_data.drop(columns=['PassengerId', 'Ticket', 'Fare', 'Name'], axis=1)
sample_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked
0,0,3,male,22,1,0,S
1,1,1,female,38,1,0,C
2,1,3,female,26,0,0,S
3,1,1,female,35,1,0,S
4,0,3,male,35,0,0,S


## Label Encoding

<div class="alert alert-block alert-success">
    As the Attributes: <b>Sex</b> and <b>Embarked</b> is Textual, We'll Need to Encode it as Scikit-learn only understands data in Numerical Representation
    <br>
    Output (Survived) is already in Numerical
</div>

In [15]:
print("-- Unique Values in Attribute: Sex --")
print(sample_data.Sex.unique())

-- Unique Values in Attribute: Sex --
['male' 'female']


In [16]:
print("-- Unique Values in Attribute: Embarked --")
print(sample_data.Embarked.unique())

-- Unique Values in Attribute: Embarked --
['S' 'C' 'Q']


In [17]:
# Labels
sex = pd.DataFrame({'Sex':['male', 'female']})
embarked = pd.DataFrame({'Embarked':['S', 'C', 'Q']})


# Initializing Label Encoders
sex_label_encoder = LabelEncoder()
embarked_label_encoder = LabelEncoder()

# Training Label Encoder
sex_label_encoder.fit(np.ravel(sex))
embarked_label_encoder.fit(np.ravel(embarked))

In [18]:
sample_data_encoded = sample_data.copy()
sample_data_encoded_original = sample_data.copy()

In [19]:
# Transform Input Attributes into Numerical Representation
sample_data_encoded['Sex'] = sex_label_encoder.transform(sample_data['Sex']) 
sample_data_encoded['Embarked'] = embarked_label_encoder.transform(sample_data['Embarked']) 

In [20]:
# All the Attributes are Numerical 
sample_data_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Survived  891 non-null    int64
 1   Pclass    891 non-null    int64
 2   Sex       891 non-null    int32
 3   Age       891 non-null    int32
 4   SibSp     891 non-null    int64
 5   Parch     891 non-null    int64
 6   Embarked  891 non-null    int32
dtypes: int32(3), int64(4)
memory usage: 38.4 KB


In [21]:
# Save the Transformed Features into CSV File 
sample_data_encoded.to_csv(r'sample-data-encoded.csv', index = False, header = True)

## Training Phase

#### Splitting data into train/test 

<div class="alert alert-block alert-info">
    Splitting data into train-test: Training = <b>80%</b> and Testing = <b>20%</b>.
</div>

In [22]:
training_data, testing_data = train_test_split(sample_data_encoded, test_size=0.2, random_state=0, shuffle=False)

In [23]:
# Save the train and test data into CSV files
training_data.to_csv(r'training-data.csv', index = False, header = True)
testing_data.to_csv(r'testing-data.csv', index = False, header = True)

In [24]:
# Printing Training Data
print("-- Training Data --")
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(training_data)

-- Training Data --
     Survived  Pclass  Sex  Age  SibSp  Parch  Embarked
0           0       3    1   22      1      0         2
1           1       1    0   38      1      0         0
2           1       3    0   26      0      0         2
3           1       1    0   35      1      0         2
4           0       3    1   35      0      0         2
5           0       3    1   28      0      0         1
6           0       1    1   54      0      0         2
7           0       3    1    2      3      1         2
8           1       3    0   27      0      2         2
9           1       2    0   14      1      0         0
10          1       3    0    4      1      1         2
11          1       1    0   58      0      0         2
12          0       3    1   20      0      0         2
13          0       3    1   39      1      5         2
14          0       3    0   14      0      0         2
15          1       2    0   55      0      0         2
16          0       3    1  

In [25]:
# Printing Testing Data
print("-- Testing Data --")
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(testing_data)

-- Testing Data --
     Survived  Pclass  Sex  Age  SibSp  Parch  Embarked
712         1       1    1   48      1      0         2
713         0       3    1   29      0      0         2
714         0       2    1   52      0      0         2
715         0       3    1   19      0      0         2
716         1       1    0   38      0      0         0
717         1       2    0   27      0      0         2
718         0       3    1   28      0      0         1
719         0       3    1   33      0      0         2
720         1       2    0    6      0      1         2
721         0       3    1   17      1      0         2
722         0       2    1   34      0      0         2
723         0       2    1   50      0      0         2
724         1       1    1   27      1      0         2
725         0       3    1   20      0      0         2
726         1       2    0   30      3      0         2
727         1       3    0   28      0      0         1
728         0       2    1   

#### Splitting Input Vector and Output of Training Data

<div class="alert alert-block alert-info">
    Splitting <b>Input (x)</b> Vector and <b>Output (y)</b> of Training Data.
</div>

In [26]:
train_x = training_data.iloc[:, 1:]
train_x.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked
0,3,1,22,1,0,2
1,1,0,38,1,0,0
2,3,0,26,0,0,2
3,1,0,35,1,0,2
4,3,1,35,0,0,2


In [27]:
train_y = training_data.iloc[:, 0]
train_y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

#### Training Model Using Support Vector Classifier

In [28]:
print("-- Training using SVC on Training Data --")
print("-- Parameters & Values: ", end='')

model_svc = SVC(gamma='auto', random_state=0)
model_svc.fit(train_x, np.ravel(train_y))

print(model_svc)

-- Training using SVC on Training Data --
-- Parameters & Values: SVC(gamma='auto', random_state=0)


In [29]:
# Saving Trained Model
pickle.dump(model_svc, open('model_svc.pkl', 'wb'))

## Testing Phase

#### Splitting Input Vector and Output of Testing Data

<div class="alert alert-block alert-info">
    Splitting <b>Input (x)</b> Vector and <b>Output (y)</b> of Testing Data.
</div>

In [30]:
test_x = testing_data.iloc[:, 1:]
test_x.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked
712,1,1,48,1,0,2
713,3,1,29,0,0,2
714,2,1,52,0,0,2
715,3,1,19,0,0,2
716,1,0,38,0,0,0


In [31]:
test_y = testing_data.iloc[:, 0]
test_y.head()

712    1
713    0
714    0
715    0
716    1
Name: Survived, dtype: int64

#### Loading Model

In [32]:
# Load saved Model
model = pickle.load(open('model_svc.pkl', 'rb'))

#### Evaluating Model

In [33]:
model_predictions = model.predict(test_x)

testing_data_prediction = testing_data.copy(deep=True)
pd.options.mode.chained_assignment = None

testing_data_prediction['Prediction'] = model_predictions

In [34]:
# Printing Testing Data
print("-- Testing Data with Prediction --")
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(testing_data_prediction)

-- Testing Data with Prediction --
     Survived  Pclass  Sex  Age  SibSp  Parch  Embarked  Prediction
712         1       1    1   48      1      0         2           0
713         0       3    1   29      0      0         2           0
714         0       2    1   52      0      0         2           0
715         0       3    1   19      0      0         2           0
716         1       1    0   38      0      0         0           1
717         1       2    0   27      0      0         2           1
718         0       3    1   28      0      0         1           0
719         0       3    1   33      0      0         2           0
720         1       2    0    6      0      1         2           1
721         0       3    1   17      1      0         2           0
722         0       2    1   34      0      0         2           0
723         0       2    1   50      0      0         2           0
724         1       1    1   27      1      0         2           0
725         0

In [35]:
# Saving Prediction into a CSV file
testing_data_prediction.to_csv(r'model_prediction.csv', index=False, header=True)

In [36]:
# Calculating Accuracy
model_accuracy_score = accuracy_score(testing_data_prediction['Survived'], testing_data_prediction['Prediction'])

print("-- Model Accuracy Score: ", end='')
print(round(model_accuracy_score,3))

-- Model Accuracy Score: 0.832


In [37]:
testing_data.Parch.unique()

array([0, 1, 3, 2, 5], dtype=int64)

# Evaluating on Unseen Data (Application Phase)

<div class="alert alert-block alert-info">
    Testing on Unseen Data (realtime/userinput).
</div>

In [43]:
print("-- Take Input --")
pclass_inp   = input("-- Please Enter Pclass (1, 2, 3) :").strip()
sex_inp      = input("-- Please Enter Gender (Male, Female) :").strip()
age_inp      = input("-- Please Enter Age :").strip()
sibsp_inp    = input("-- Please Enter Number of Sibling/Spouse (0, 1, 2, 3) :").strip()
parch_inp    = input("-- Please Enter Number of Parent/Children (0, 1, 2, 3, 5) :").strip()
embarked_inp = input("-- Please Enter Embarked (C, Q, S) :").strip()

-- Take Input --
-- Please Enter Pclass (1, 2, 3) :2
-- Please Enter Gender (Male, Female) :male
-- Please Enter Age :23
-- Please Enter Number of Sibling/Spouse (0, 1, 2, 3) :2
-- Please Enter Number of Parent/Children (0, 1, 2, 3, 5) :1
-- Please Enter Embarked (C, Q, S) :S


In [44]:
# Convert Input into Feature Vector
user_inp = pd.DataFrame({
    'Pclass':   [pclass_inp],
    'Sex':      [sex_inp],
    'Age':      [age_inp],
    'SibSp':    [sibsp_inp],
    'Parch':    [parch_inp],
    'Embarked': [embarked_inp]
})

print("-- User Inputs are: ")
print(user_inp)

-- User Inputs are: 
  Pclass   Sex Age SibSp Parch Embarked
0      2  male  23     2     1        S


In [45]:
# Transform Input Into Numerical Representation
user_inp_features = user_inp.copy()

# user_inp_features['Pclass']   = 
user_inp_features['Sex']      = sex_label_encoder.transform(user_inp['Sex'])
# user_inp_features['Age']      = 
# user_inp_features['SibSp']    = 
# user_inp_features['Parch']    = 
user_inp_features['Embarked'] = embarked_label_encoder.transform(user_inp['Embarked'])

print("-- User Input Encodede Feature Vector --")
print(user_inp_features)

-- User Input Encodede Feature Vector --
  Pclass  Sex Age SibSp Parch  Embarked
0      2    1  23     2     1         2


In [46]:
# Loading Saved SVC Model
model = pickle.load(open('model_svc.pkl', 'rb'))

In [47]:
# Prediction on Unseen Data
predicted_survival = model.predict(user_inp_features)
if(predicted_survival == 1):
    prediction = 'Survived'
elif(predicted_survival == 0):
    prediction = 'Not Survived'


print("-- Prediction: ", end='')
print(prediction)

-- Prediction: Not Survived
