# Exercise 1

## Performing a classification task with titanic dataset using RandonForest method without cross-validation.

a) Download the titanic dataset from the supplied URL: ``` https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv ```

b) This is a classification task with two classes in which the labels are stored in “Survived”.

c) As the classification method you will be using RandomForest.
See here: ``` https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html ```

d) You will be using all features supplied which are: Pclass (categorical), Name, Sex (Categorical), Age,  Siblings/Spouses Aboard, Parents/Children Aboard and Fare.

e) Initialize RandomForest with 10 trees.
``` from sklearn.ensemble import RandomForestClassifier ```

f) Train the classifier with training data which is 80% of the whole dataset. In this example we do not use validation data. There is also no cross validation.

g) Drop unnecessary columns. In this example we drop only “Name” even though many other column can be irrelevant.

h) Apply one-hot encoding for the categorical variables.

i) Drop examples with NA values. ``` dropna() ```

j) Train the classifier ad make predictions on test data.

k) Calculate the accuracy of the mode in test data.
``` from sklearn.metrics import accuracy_score ```


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)


In [None]:
data = data.drop(['Name'], axis=1)
data = pd.get_dummies(data, columns=['Sex', 'Pclass'], drop_first=True)
data = data.dropna()
X = data.drop('Survived', axis=1)
y = data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=10, random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7528089887640449


# Exercise 2

Re-perform the second exercise by applying feature selection.

Refer to ``` https://scikit-learn.org/stable/modules/feature_selection.html ``` for feature selection.

Print out the name of top **three** (```n_features_to_select=3```) features selected. To do so you need to use ```get_support()```.

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import accuracy_score


url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)



In [None]:
data = data.drop(['Name'], axis=1)
data = pd.get_dummies(data, columns=['Sex', 'Pclass'], drop_first=True)
data = data.dropna()
X = data.drop('Survived', axis=1)
y = data['Survived']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=10, random_state=42)

clf.fit(X_train, y_train)

sfs = SequentialFeatureSelector(clf, n_features_to_select=3)
sfs.fit(X_train, y_train)
X_train_selected = sfs.transform(X_train)
X_test_selected = sfs.transform(X_test)

selected_columns = X.columns[sfs.get_support()] # Prendi i nomi delle features selezionate

clf.fit(X_train_selected, y_train)

y_pred = clf.predict(X_test_selected)

accuracy = accuracy_score(y_test, y_pred)
print(f"Selected columns: {selected_columns}")
print(f"Accuracy: {accuracy}")


Selected columns: Index(['Siblings/Spouses Aboard', 'Sex_male', 'Pclass_3'], dtype='object')
Accuracy: 0.7359550561797753
