#### https://www.kaggle.com/competitions/titanic/overview
Overview of the project


# Import Statements

In [10]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Formatting to 2 digits

In [11]:
pd.options.display.float_format = '{:,.2f}'.format

In [12]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

### Column names and meaning

---------------------------

**Characteristics:**  

    :Number of Instances: 891 

    :Number of Attributes: 12 numeric/categorical predictive. The Survived value (attribute 2) is the target.

    :Attribute Information (in order):
        1. PassengerId     Id of Passenger
        2. Survived     Survival  0 = No, 1 = Yes
        3. Pclass    Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd pclass: A proxy for socio-economic status (SES)
                1st = Upper
                2nd = Middle
                3rd = Lower
        4. Name     
        5. Sex      Sex	
        6. Age       Age in years	Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
        7. SibSp      # of siblings / spouses aboard the Titanic	
            The dataset defines family relations in this way...
            Sibling = brother, sister, stepbrother, stepsister
            Spouse = husband, wife (mistresses and fiancés were ignored)
        8. Parch      # of parents / children aboard the Titanic	
                    The dataset defines family relations in this way...
                    Parent = mother, father
                    Child = daughter, son, stepdaughter, stepson
                    Some children travelled only with a nanny, therefore parch=0 for them.
        9. Ticket      Ticket number	
        10. Fare      Passenger fare
        11. Cabin   Cabin number
        12. Embarked   Port of Embarkation - C = Cherbourg, Q = Queenstown, S = Southampton
    

# Preliminary Data exploration

In [18]:
df_train.shape
#891 rows 12 columns
#df_titanic.columns
df_train.drop("PassengerId",axis=1,inplace=True)

In [23]:
#df_train.head()
#Cabin has 687 NaN values, embarked has 2 and age has 177
df_train.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

Let's figure out how to deal with the NaN values and think of questions i want to answer
Get to know the data and how it looks before i try to use a Linear regression to try to predict
Then i will try other models that i haven't used before like Logistic Regression and the RandomForest which i saw


## General steps to follow

1.Acquire and explore the data: Download the train and test datasets from Kaggle and perform exploratory data analysis (EDA) to understand the distribution of the variables and identify any patterns or missing data.

2.Clean and preprocess the data: Handle missing values and outliers, and convert categorical variables into numerical variables so that they can be used in the model.

3.Feature Engineering: Create new features by combining existing variables or by extracting useful information from the variables.

4.Select and train a model: Select a suitable machine learning algorithm, such as Random Forest, SVM, Logistic Regression, etc. and train it on the preprocessed data.

5.Evaluate the model: Use techniques such as cross-validation to evaluate the performance of the model.

6.Make predictions on the test dataset and submit the result to Kaggle.

# Going in depth on the first steps

Acquiring and exploring the data: This step involves downloading the train and test datasets from Kaggle and using tools such as pandas and matplotlib to perform exploratory data analysis (EDA) to understand the distribution of the variables and identify any patterns or missing data.
For example, you can use the pandas library to load the train and test datasets into dataframes and use the .head() method to view the first few rows of the data. You can use the .info() method to get a summary of the dataframe, including the number of non-null values in each column, and the .describe() method to get a summary of the numerical variables.

You can also use the matplotlib library to create visualizations such as histograms, bar plots, and scatter plots to help you understand the distribution of the variables and identify patterns or outliers in the data.

Cleaning and preprocessing the data: This step involves handling missing values and outliers, and converting categorical variables into numerical variables so that they can be used in the model.
For example, if you find that a column has missing values, you may choose to fill in the missing values with the mean or median of the column. If you find outliers in the data, you may choose to remove them or replace them with a more reasonable value.

To convert categorical variables into numerical variables, you can use techniques such as one-hot encoding, which creates a new binary column for each unique category in a categorical variable.

Feature Engineering: This step involves creating new features by combining existing variables or by extracting useful information from the variables.
For example, you can create a new feature by combining the "Pclass" and "Fare" columns to indicate the fare paid by a passenger in each class. You can also create new features by extracting useful information from the "Name" column, such as the title of a passenger.

You can also group some data based on their similarity, for example the age, you can group them by age range (child, young, adult, senior) which can give more information to the model

It's important to note that feature engineering is an iterative process, and you may need to go back and repeat step 2 and 3 multiple times to improve your model's performance.