# Titanic Dataset
**Description**:
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder,StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score,confusion_matrix

Data Exploration

**Variable Definition Key**
| Variable | Description                               | Values                                      |
|----------|-------------------------------------------|---------------------------------------------|
| survival | Survival                                  | 0 = No, 1 = Yes                             |
| pclass   | Ticket class                              | 1 = 1st, 2 = 2nd, 3 = 3rd                   |
| sex      | Sex                                       |                                             |
| age      | Age in years                              |                                             |
| sibsp    | Number of siblings / spouses aboard      |                                             |
| parch    | Number of parents / children aboard      |                                             |
| ticket   | Ticket number                             |                                             |
| fare     | Passenger fare                            |                                             |
| cabin    | Cabin number                              |                                             |
| embarked | Port of Embarkation                       | C = Cherbourg, Q = Queenstown, S = Southampton |



In [None]:
#reading Data
titanic_data = pd.read_csv('Titanic-Dataset.csv')

In [None]:
#showing data
titanic_data.head()

In [None]:
#Show columns
titanic_data.columns

In [None]:
#information about the dataset 
titanic_data.info()

In [None]:
#Showing statistical about numeric columns
titanic_data.describe()

Data Cleaning

In [None]:
#Cheking null values in every columns
titanic_data.isna().sum()

as we see there is 3 columns that contains missing values 'Age','Cabin' and 'Embarked' let's choose the best strategy for each column

In [None]:
#The 'Age' column, we use the mean stategy to deal with missing values
age_mean = titanic_data['Age'].mean()
titanic_data['Age'] = titanic_data['Age'].fillna(age_mean)
titanic_data['Age'].isna().sum()

In [None]:
#The 'Cabin' Column let's drop this column beacause it doesn't help in our task:
titanic_data.drop(columns=['Cabin'],inplace=True)

In [None]:
#For the 'Embarked' column let's fill with the most frequent item
most_frequent_item = titanic_data['Embarked'].value_counts().idxmax()
titanic_data['Embarked'] = titanic_data['Embarked'].fillna(most_frequent_item)
titanic_data['Embarked'].isna().sum()

In [None]:
#Recheck null values in the dataset
titanic_data.isna().sum()

In [None]:
#Let's encode the 'Sex' and 'Embarked' columns
label_encoder = LabelEncoder()
titanic_data['Sex'] = label_encoder.fit_transform(titanic_data['Sex'])
titanic_data['Embarked'] = label_encoder.fit_transform(titanic_data['Embarked'])

In [None]:
#Drop unnessery columns : 
titanic_data.drop(columns=['Ticket','PassengerId','Name'],inplace=True)

In [None]:
titanic_data.head()

Data Visualization

In [None]:
#Exploring the distribution by Age
sns.histplot(x='Age',hue='Survived',kde=True,data=titanic_data)
plt.title('Distribution of survivers by age')
plt.legend(['Not survived','survived'])
plt.show()

In [None]:
#Exploring the distribution of surviver by gender
sns.countplot(x='Survived',hue='Sex',data=titanic_data)
plt.title("Survival count by gender")
plt.legend(['Female','Male'])
plt.xticks((0,1),labels=['Not survived','survived'])
plt.show()

In [None]:
#Exploring the distribution of surviver by pesenger class
sns.countplot(x='Survived',hue='Pclass',data=titanic_data)
plt.title("Survival count by pesenger class")
plt.xticks((0,1),labels=['Not survived','survived'])
plt.show()

In [None]:
#Exploring the Fare between pessenger 
sns.histplot(x='Fare',hue='Survived',kde=True,data=titanic_data)
plt.title('Distribution of survivers by age')
plt.legend(['Not survived','survived'])
plt.show()

In [None]:
#Split the data into training and testing 
X = titanic_data.drop(columns=['Survived'],axis=1)
y = titanic_data['Survived']


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

scaler = StandardScaler()
X_train_scalled = scaler.fit_transform(X_train)
X_test_scalled = scaler.fit_transform(X_test)


model = SVC()
model.fit(X_train_scalled,y_train)
y_pred = model.predict(X_test_scalled)
print(f"prediction :{y_pred}")

print('_______________________________________________________________________________')
accuracy = accuracy_score(y_test,y_pred)
precision = precision_score(y_test,y_pred)
confx_matrix = confusion_matrix(y_test,y_pred)
print(f"Accuracy: {accuracy}")
print(f"Precission: {precision}") 
print(f"Confussion Matrix :\n{confx_matrix}")



In [None]:
prediction = pd.DataFrame(y_pred,columns=['Survived'])
prediction['Survived'] = prediction['Survived'].map({0:'Did not survived',1:'Survived'})
prediction