<p style="text-align:center;">
<img src="https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/datascience-notebook-header.png?raw=true"
     alt="DigitalFuturesLogo"
     style="float: center; margin-right: 10px;" />
</p>

## Digital Futures Data Programme
### Logistic Regression
#### V2

In [None]:
## Imports
## Importing the big 4 - Pandas, Numpy, Seaborn & matplotlib
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt
## Import the metrics we'll be using
from sklearn import metrics
## Import Logistic Regression from sklearn
from sklearn.linear_model import LogisticRegression

## 1. EDA

In [None]:
## Read the dataframe


In [None]:
## Explore the first couple of entries


In [None]:
## Explore the last couple of entries


In [None]:
## What's the dimension of our data?


In [None]:
## What columns are we using?


<img src="https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/Titanic.png?raw=True"/>

In [None]:
## What are the nulls & data types


#### Data cleaning & prep

In [None]:
## How many passengers survived v.s perished?


In [None]:
## Check the unique values for sex -- all other fields are either not categories (i.e take a lot of different values)
## or their unique values are already explained in the data dictionary (survived, ticket class, embarked)



In [None]:
## To work with logistic regression, we will need our values to be numerical. There are a couple of values that
## concern us. The first (and easiest to quickfix) is sex, which is binary in this data. 



In [None]:
## Let's check the unique values again, to see if it worked



In [None]:
## We can now start thinking about the other columns, such as age. This is certainly a continuous range, so
## we could investigate the distribution



> What are your observations?

In [None]:
## Another way to view distributions, but for categories is to take the value counts. Remember what we did for survived?
## That quickly becomes quite silly for 5 categories.. or 8, or 10. So here's a much better way using the value_counts() method



In [None]:
## We can do the same for parch



In [None]:
## For these values, we can visualise them most easily using a barplot



> If we had (much!) fewer observations, we could also visualise them using a dotplot

In [None]:
## Typically, we won't be interested in looking at just the distribution of a single feature.
## It's more interesting to inspect how it compares against other features in our data, and look for patterns or correlations
## The easiest visual tool to achieve that is our trusty pairplot



In [None]:
## Even more interesting, what if we want a broad, wholistic view of our data? This is where the correlation
## matrix comes in - providing the (default: Pearson) correlation between each of the features



> What are your observations based on this?

<details><summary> Click me for solutions </summary>
    
* Survived 1 -- means it's highly likely that Sex 1: They survived :) -- means Female


* Survived 0 -- means it's highly likely that Sex 0: They didn't survive :( -- means Male
    
The bigger the passenger class indicates a lower fare paid (-.42 corr) and a lower change of surviving (-.29 corr)
</details>

#### Deeper dive - tickets

In [None]:
## Tickets will be tricky, for three main reasons:
## 1. They're likely too many to categorise



In [None]:
## 2. They're probably not numerical (or contain numbers, but not only)



_And 3. How do we actually extract signal from these?.._

One strategy is.. <s>ignorance is bliss</s> -- let's just remove the column!

Another strategy: What if we were to just focus on extracting the letters?

In [None]:
## We notice that some tickets have letters, others don't. The numbers are clearly an identifier (so bear no signal)
## but could the letters indicate anything about the ticket? We could explore this.
## Let's use (the absolute best tool) REGEX! ### import re

import re
def extract_letters(x):
    return re.sub(r'[\d\s]', r'', x) if x==x else x

In [None]:
## Let's test it on an example
## Ticket name: 752382 AB



In [None]:
## Seems to be working! Let's apply it to the whole column
## Save the results in a new column, let's call it df['Ticket_letters']



In [None]:
## Do .head() to view the new data



In [None]:
## Looks great! But what does the distribution roughly look like?
## Again, we could use .value_counts() 



In [None]:
##### TODO

## Distribution plots (pairlpots, histograms)

## Boxplots -- super important to look for outliers!

## Embarked into category -- address the nulls for it separately here (S, C and Q)

## Fix the extract_letters function (we're carrying dots as well) - or do we need to?

## ...

## ...

## 2. Feature Engineering

In [None]:
## Let's first remember what are our potential features
## We can do that using the .columns attribute



In [None]:
## We'll only extract a couple of these to use. Which ones?
## Clearly, the passenger ID is not of any use. Survived is our target - but we'll keep it in for now, and remove it later.
## Name also acts as an ID (and offers no signal).
## Ticket and Cabin fall into their own category: There _may_ be signal to be extracted out of them one way or another,
## but we didn't do so - so for now, we won't use them.



In [None]:
## Let's save the dataframe we'll use -- don't forget to use .copy()



In [None]:
## Let's do .head()


In [None]:
## Uh-oh.. I can see some null values. Do we remember how to check the nulls? There's many ways


In [None]:
## Let's do the simplest thing: drop the nulls - since they come in negligible amounts


## We can also create y now, the target


In [None]:
## This is why we kept 'Survived' in until now - since we were going to drop nulls. If we dropped the nulls and y was separate,
## we would have a dimension mismatch. Still, let's make sure that's not the case.
## The length of y (# of targets) should match the # of rows in the data (# of observations)


In [None]:
## Fantastic! But let's not forget to drop the target now, so we only keep our features


In [None]:
## There is one final small problem.. we mentioned this a couple of times: our model won't work until we have numerical data only
## Using the .info() method or the .dtypes attribute we can see Embarked is still a category on its own. 
## We'll use OHE to fix this quickly


In [None]:
## Finally, let's view our data


In [None]:
## Are we sure there are no nulls left? No non-numerical observations? Final check!


In [None]:
##### TODO

## All feature engineering should go in a single, easily accessible & reproducible function

## Null handling - we just discarded them: was this the best way to handle them in this case?

## Definitely have a look at the data distribution - scaling! (StandardScaler, MinMaxScaler, LogScaler)

## Better data extraction -- feature selection

## Model selection (more advanced)

## 3. Logistic Regression

In [None]:
## Like all good things in Python, we only need 1 line ~ Guido Van Rossum
## We just need to create a LogisticRegression() object. All parameters have set defaults, so no need to do anything else,
## but we can always improve the model by considering the parameters


In [None]:
## So, we have an empty LogisticRegression() object. This needs to be fit on our data first and foremost
## Since we're using sklearn, the rule of thumb is: first parameter = features, second parameter = target


In [None]:
## Now that the model is fit, we can make our prediction. Fortunately, this too is a single line of code!


In [None]:
## What is this?! Why do we get two sets of values for each prediction?
## Ohh that's right - one is the probability of outcome 0 (perished), the other the probability of outcome 1 (survived)
## Since these are the only 2 options, it adds up (quite literally! They add up to 1)

## Let's store them in 2 columns then, called 'prob_perish' and 'prob_surv'


In [None]:
## .head() to check our progress


In [None]:
## Fantastic! However.. we still need to make a binary prediction: Has this passenger survived or not?
## This will be based on the probabilities offered. If prob_surv > .5, we can say they survived.
## Why .5? This is what we call a CUT-OFF POINT - and it's yet another parameter we can pick! We'll try .5, but it might indeed
## not be the optimum value. Maybe we need to use .6, maybe .65: You should explore this on your own
## We'll store our prediction in a column called 'y_pred' and use the np.where() method to do so



In [None]:
## Finally, now we should be ready! Let's do a final .head()


> Looks right, at least in theory. The question on everyone's lips now should be: How well did we do?

## 4. Evaluate performance

In [None]:
## We'll use our trusty confusion matrix to answer the question above. Let's first have a look at it on its own


In [None]:
## There are 4 main metrics we're interested in at this stage: accuracy, precision, recall and F1.
## The 'metrics' module from sklearn covers all of them (and more!) So we can use a function like the one below

def get_results(actual, predicted):
    print("The confusion matrix for your predictions is:")
    print(metrics.confusion_matrix(actual, predicted), "\n")
    print(f'The accuracy of your model is: {metrics.accuracy_score(actual, predicted)}')
    print(f'The recall of your model is: {metrics.recall_score(actual, predicted)}')
    print(f'The precision of your model is: {metrics.precision_score(actual, predicted)}')
    print(f'The F1-score of your model is: {metrics.f1_score(actual, predicted)}')

In [None]:
## Now, we simply apply the function on our predictions


> What do you notice at this stage?

In [None]:
## Not bad! Can certainly do better, but not bad. The function is quite farfetched though
## It's good that we can take all metrics separately, but surely there must be an easier way.
## The classification report provides just that! Don't forget to add print() otherwise it will be hard to look at :p



In [None]:
## Much better! I really like heatmaps and pretty colours though
## Luckily, 'metrics' has us covered once again, using the ConfusionMatrixDisplay tool



In [None]:
#### TODO

## Explore the cutoff point

## Explore feature selection, standardization -- everything that went into Feature Engineering

## Explore more metrics! ROC/AUC -- Maximizing the area under the curve (AUC): AUC > .8 usually really good

## Predicted values v.s Actual values distribution - what would a good distribution look like?

## Play with hyperparameters for the logistic regression

## ...

## ...

## Your Turn!

Can you do better? Start by following the advice given above for each step, and explore on your own.

Feeling like you nailed it? Or just want to check yourself? Try running it on the test dataset! You can find it [here](https://noodle.digitalfutures.com/course/view.php?id=81). Wait! This data.. doesn't have the survived column? That's right - because this is actually part of a [Kaggle competition](https://www.kaggle.com/datasets). You too can participate, and you can submit as many solutions as you want! Once you come up with an improved performance on your training data, make sure to apply it to the test, generate your predictions and [submit them to see how you did](https://www.kaggle.com/competitions/tabular-playground-series-apr-2021/data). The competition ended, but you can always do a 'late submission' just to test yourself :)

Why don't we start with our base model as an example

In [None]:
df = pd.read_csv('test.csv')
df.head()

In [None]:
## We'll do the cleaning properly all in 1 go, since we now know what to do
def cleaning_prep(df):
    df['Sex'] = df['Sex'].map({'male':0, 'female':1}) ## Change sex to numbers
    Features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] ## select features
    df = df[Features].copy() ## only use selected features
    df = pd.get_dummies(data = df, columns = ['Embarked'], prefix='Emb', drop_first=True) ## OHE Embark
    df = df.fillna(0) ## We cannot drop the nulls, since we need the whole data to submit
    return df
df_test = cleaning_prep(df)

In [None]:
## Let's apply the model and get our predictions: Remember, we use the same parameters as we did in the training case
## IMPORTANT: We already have the logistic regression fit on the training set, we now only need to predict
df_test[['prob_perish', 'prob_surv']] = lr.predict_proba(df_test)

In [None]:
## Now we get our predictions:
df_test['Survived'] = np.where(df_test['prob_surv']>.5, 1, 0)
## We also need to grab the passengerID to match:
df_test['PassengerId'] = df['PassengerId'][df['PassengerId'].index.isin(df_test.index)]

In [None]:
## Save them as a csv file, to then be uploaded -- note: the submission only expects the ID & Survived column
df_test[['PassengerId', 'Survived']].to_csv("Titanic_pred.csv", index=False)

I now take my file and upload it here:
<img src="https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/Titanic_upload1.png?raw=True"/>

And this is how we did:
<img src="https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/Titanic_upload2.png?raw=True"/>

Now it's your turn! I'm sure you can do much better. Good luck :)