# Data Mining project
## Members:
- Mateusz Idziejczak 155842
- Mateusz Stawicki 155900
- Anastasiya
- Baris
## Dataset
The dataset we are going to use is the [Titanic](https://www.kaggle.com/competitions/titanic) dataset from Kaggle. It contains data of 2224 passengers. The dataset is in a CSV format and contains TODO columns. The columns are: TODO

## Implementation
### Data loading
First we need to load the data from the CSV file. We will use the pandas library to do this.


In [510]:
import pandas as pd

data = pd.read_csv('data/train.csv')

### Data exploration
Now we will explore the data to see what it contains.

In [511]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [512]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Data preprocessing
Before we can start building the model we need to preprocess the data. This includes:
- splitting the data into features and labels
- Removing columns that are not useful
- Handling missing values
- Encoding categorical variables
- Scaling the data
- Splitting the data into training and testing sets

#### Splitting the data into features and labels

In [513]:
X = data.drop('Survived', axis=1)
y = data['Survived']

#### Removing columns that are not useful
We will remove the `PassengerId`, `Name` and `Ticket` columns as they are not useful for the model.

In [514]:
X = X.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

#### Handling missing values
We will fill the missing values in the `Age` columns with the mean and mode respectively.

In [515]:
X.isna().sum().to_frame().reset_index().rename(columns={0: 'Missing values'})

Unnamed: 0,index,Missing values
0,Pclass,0
1,Sex,0
2,Age,177
3,SibSp,0
4,Parch,0
5,Fare,0
6,Embarked,2


In [516]:
X['Age'] = X['Age'].fillna(X['Age'].mean())
X['Embarked'] = X['Embarked'].fillna(X['Embarked'].mode()[0])

#### Encoding categorical variables
We will encode the `Embarked` and `Sex` column using one-hot encoding.

In [517]:
columns_to_encode = ['Embarked', 'Sex']
try:
    # X = pd.get_dummies(X, columns=columns_to_encode)
    from sklearn.preprocessing import LabelEncoder
    
    label_encoder = LabelEncoder()
    for column in columns_to_encode:
        X[column] = label_encoder.fit_transform(X[column])
except KeyError:
    print('Columns already encoded')

We will also split Age into 5 age groups. We need to know how survival rate depends on age.

In [518]:
df_X = X.copy()
df_X['Survived'] = y
bins = [0, 9, 14, 42, 57, 59, 100]
df_X['AgeBin'] = pd.cut(df_X['Age'], bins=bins, labels=[0, 1, 2, 3, 4, 5])
df_X.groupby(['AgeBin', 'Survived']).size().unstack().apply(lambda x: x / x.sum(), axis=1)

  df_X.groupby(['AgeBin', 'Survived']).size().unstack().apply(lambda x: x / x.sum(), axis=1)


Survived,0,1
AgeBin,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.387097,0.612903
1,0.533333,0.466667
2,0.635036,0.364964
3,0.614583,0.385417
4,0.571429,0.428571
5,0.730769,0.269231


In [519]:
def age_category(age):
    if age <= bins[1]:
        return 0
    elif age <= bins[2]:
        return 1
    elif age <= bins[3]:
        return 2
    elif age <= bins[4]:
        return 3
    elif age <= bins[5]:
        return 4
    else:
        return 5

X['Age'] = X['Age'].apply(age_category)

and count the family size.

In [520]:
X['FamilySize'] = X['SibSp'] + X['Parch'] + 1
X = X.drop(['SibSp', 'Parch'], axis=1)

#### Scaling the data
We will scale the data using the L1 normalization.

In [521]:
from sklearn.preprocessing import Normalizer

X1 = X[['Fare']]
X2 = X.drop(['Fare'], axis=1)

# scaler = Normalizer(norm='l1').fit(X)
# X = pd.DataFrame(scaler.transform(X), columns=X.columns)
scaler = Normalizer(norm='max').fit(X1)
X1 = pd.DataFrame(scaler.transform(X1), columns=X1.columns)
X = pd.concat([X1, X2], axis=1)

#### Splitting the data into training and testing sets
We will split the data into training and testing sets using the `train_test_split` method from the `sklearn` library.

In [522]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Final data
The final data looks like this:

In [523]:
X.head(15)

Unnamed: 0,Fare,Pclass,Sex,Age,Embarked,FamilySize
0,1.0,3,1,2,2,2
1,1.0,1,0,2,0,2
2,1.0,3,0,2,2,1
3,1.0,1,0,2,2,2
4,1.0,3,1,2,2,1
5,1.0,3,1,2,1,1
6,1.0,1,1,3,2,1
7,1.0,3,1,0,2,5
8,1.0,3,0,2,2,3
9,1.0,2,0,1,0,2


### Selecting features
Now we will select the features that we will use to build the model. We will use the `SelectKBest` method from the `sklearn` library to select the best features.

In [524]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

best_features = SelectKBest(score_func=chi2, k=5)
fit = best_features.fit(X, y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
feature_scores = pd.concat([dfcolumns, dfscores], axis=1)
feature_scores.columns = ['Feature', 'Score']
feature_scores = feature_scores.sort_values(by='Score', ascending=False)
feature_scores

Unnamed: 0,Feature,Score
2,Sex,92.702447
1,Pclass,30.873699
4,Embarked,10.202525
3,Age,3.300942
5,FamilySize,0.336787
0,Fare,0.109251


### Model building
Now we will build the model. We will use the Perceptron and the Random Forest Classifier.

#### Perceptron

In [525]:
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score

perceptron = Perceptron()
perceptron.fit(X_train, y_train)
y_pred = perceptron.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.7877094972067039

#### Random Forest Classifier

In [526]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.8435754189944135

### Model evaluation
Now we will evaluate the models using the confusion matrix and the classification report.

In [527]:
from sklearn.metrics import confusion_matrix, classification_report

y_pred = perceptron.predict(X_test)
print('Perceptron')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

y_pred = random_forest.predict(X_test)
print('Random Forest Classifier')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Perceptron
[[92 22]
 [16 49]]
              precision    recall  f1-score   support

           0       0.85      0.81      0.83       114
           1       0.69      0.75      0.72        65

    accuracy                           0.79       179
   macro avg       0.77      0.78      0.77       179
weighted avg       0.79      0.79      0.79       179

Random Forest Classifier
[[105   9]
 [ 19  46]]
              precision    recall  f1-score   support

           0       0.85      0.92      0.88       114
           1       0.84      0.71      0.77        65

    accuracy                           0.84       179
   macro avg       0.84      0.81      0.82       179
weighted avg       0.84      0.84      0.84       179



### Conclusion
The Random Forest Classifier model performed better than the Perceptron model. The Random Forest Classifier model had an accuracy of 0.84, while the Perceptron model had an accuracy of 0.78. The Random Forest Classifier model also had better precision, recall and F1-score than the Perceptron model. This shows that the Random Forest Classifier model is better at predicting whether a passenger survived the Titanic disaster or not.

----------\------

### Additional scripts
#### Searching for best age bins

In [528]:
# from sklearn.ensemble import RandomForestClassifier
# import numpy as np
# 
# max_accuracy = 0
# max_bins = []
# for d in range(1, 100000):
#     (i, j, g, h, m) = np.random.randint(6, 75, 5)
#     bins = sorted([0, i, j, h, g, m, 100])
#     copy = X.copy()
# 
# 
#     def age_category(age):
#         if age <= bins[1]:
#             return 0
#         elif age <= bins[2]:
#             return 1
#         elif age <= bins[3]:
#             return 2
#         elif age <= bins[4]:
#             return 3
#         elif age <= bins[5]:
#             return 4
#         elif age <= bins[6]:
#             return 5
#         else:
#             return 6
# 
# 
#     copy['Age'] = copy['Age'].apply(age_category)
#     total_accuracy = 0
#     for _ in range(5):
#         X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#         random_forest = RandomForestClassifier()
#         random_forest.fit(X_train, y_train)
#         y_pred = random_forest.predict(X_test)
#         accuracy = accuracy_score(y_test, y_pred)
#         total_accuracy += accuracy
#     total_accuracy /= 5
#     if total_accuracy > max_accuracy:
#         max_accuracy = total_accuracy
#         max_bins = bins
#         print(max_accuracy, max_bins, d)
#     if d % 100 == 0:
#         print(max_accuracy, max_bins, d)