# Machine Learning Hackathon at FHWS

## Delivery of your solution
1. Clone the repository.
2. Use the competition notebook (notebook.ipynb) to present your final solution. Fill the required cells with your solution.
3. Rename it to your team name or number. 
4. The notebook and dataset.csv plus any additional data/files plus should be zipped.
5. Rename zip-file to your team name or number. 
6. Send the zip file to christoph.raab@fhws.de. The file must not be larger than 50mb!

## Important notes
- Please _ensure_ that the notebook plus the files are runnable as _standalone_ from a computer with python > 3.5 
- The last cell of the notebook should show the evaluation of dataset. This cell loads the evaluation data which will determine the final winner. This cell should use your preprocessing steps etc. and ouput the accuracy of your trained model.
- The model _should_ be from type scipy or keras!
- The cell can be tested by merely renaming the dataset to evaluation.csv 
- *Only this last cell will be evaluated. Hence, make sure this contains everthing to obtain your best performance.*

### If there are any problems, contact your scrum master. Ensure carefully that your team meets the above instructions; otherwise your team will be excluded from the competition!

### Please test your notebook on multiple computers to verify the interoperability.
----


# Objective
Find the best strategies to improve for the next marketing campaign. How can the financial institution have a greater effectiveness for future marketing campaigns? In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies.

# A. Attributes Description: <br>

Input variables:<br>
# Ai. bank client data:<br>
<a id="bank_client_data"></a>
1 - **age:** (numeric)<br>
2 - **job:** type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')<br>
3 - **marital:** marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)<br>
4 - **education:** (categorical: primary, secondary, tertiary and unknown)<br>
5 - **default:** has credit in default? (categorical: 'no','yes','unknown')<br>
6 - **housing:** has housing loan? (categorical: 'no','yes','unknown')<br>
7 - **loan:** has personal loan? (categorical: 'no','yes','unknown')<br>
8 - **balance:** Balance of the individual.
# Aii. Related with the last contact of the current campaign:
<a id="last_contact"></a>
8 - **contact:** contact communication type (categorical: 'cellular','telephone') <br>
9 - **month:** last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')<br>
10 - **day:** last contact day of the week (categorical: 'mon','tue','wed','thu','fri')<br>
11 - **duration:** last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.<br>
# Aiii. other attributes:<br>
<a id="other_attributes"></a>
12 - **campaign:** number of contacts performed during this campaign and for this client (numeric, includes last contact)<br>
13 - **pdays:** number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)<br>
14 - **previous:** number of contacts performed before this campaign and for this client (numeric)<br>
15 - **poutcome:** outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')<br>

Output variable (desired target):<br>
21 - **y** - has the client subscribed a term deposit? (binary: 'yes','no')
***

## Dataset Summary

<h3> Exploring the Basics </h3>
<a id="overall_analysis"></a>
***
<ul>
<li type="square"> <b>Mean Age</b> is aproximately 41 years old. (Minimum: 18 years old and Maximum: 95 years old.)</li><br>
<li type="square"> The <b>mean balance</b> is 1,528. However, the Standard Deviation (std) is a high number so we can understand through this that the balance is heavily distributed across the dataset.</li><br>
<li type="square">As the data information said it will be better to drop the duration column since duration is highly correlated in whether a potential client will buy a term deposit. Also, <b>duration is obtained after the call is made to the potential client</b> so if the target client has never received calls this feature is not that useful. The reason why duration is highly correlated with opening a term deposit  is because the more the bank talks to a target client the higher the probability the target client will open a term deposit since a higher duration means a higher interest (commitment) from the potential client. </li><br>
</ul>

**Note: There are not that much insights we can gain from the descriptive dataset since most of our descriptive data is located not in the "numeric" columns but in the "categorical columns".**
***


In [1]:
import numpy as np
import pandas as pd

The following cell loads the dataset as panda Dataframe into the notebook and verifyies it. 

In [2]:
df = pd.read_csv('dataset.csv')
term_deposits = df.copy()
# Have a grasp of how our data looks.
print(df.shape)
if df.shape != (8930,18):
    raise ValueError("Wrong Dataset shape. Make sure you load the correct data!")
df.head()


(8930, 18)


Unnamed: 0.1,Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
2,3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
3,4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes
4,5,42,management,single,tertiary,no,0,yes,yes,unknown,5,may,562,2,-1,0,unknown,yes


----
### Further Actions
- Add cells and functions as you like.
- You are not restricted to python notebooks, meaning you can develop your solution outside of this notebook.
- However, _this_ kernel should be the final solution and should contain any _final_ preprocessing steps and the learning phase. 

#### Every type of cross-validation or optimization is not allowed here. Use your final parameters.

## Evaluation Cell
This cell should show the final soluation of your team. It currently just loads the data and returns a random number between 0 and 1. Your solution can return 0 to 1 or 0 to 100. 

In [3]:
def evaluation():
    df = pd.read_csv('evaluation.csv')
    term_deposits = df.copy()
    # Have a grasp of how our data looks.
    model.score....
    
    print(df.shape)
    if df.shape != (2232,18):
        raise ValueError("Wrong Dataset shape. Make sure you load the correct data!")
    df.head()
    return np.random.uniform()
evaluation()

(2232, 18)


0.06420903591660798