# Financial Inclusion in Africa:


## 1. Objective:
The objective of this competition is to create a machine learning model to predict which individuals are most likely to have or use a bank account.

## 2.  Evaluation metric:
The evaluation metric for this challenge will be the percentage of survey respondents for whom you predict the binary 'bank account' classification incorrectly.

## 3. Submission:
Your submission file should look like:
```
unique_id                   bank_account
<string>                    <number>
uniqueid_1 x Kenya              1
uniqueid_2 x Kenya              0
uniqueid_3 x Kenya              1 
```

## 4. Data:

* Numerical:
    * household_size: Number of people living in one house
    * age_of_respondent: The age of the interviewee

* Categorical/string:
    * country: Country interviewee is in.
    * year*: Year survey was done in.
    * uniqueid: Unique identifier for each interviewee
    * location_type: "Type of location: Rural, Urban"
    * cellphone_access: "If interviewee has access to a cellphone: Yes, No"
    * gender_of_respondent: "Gender of interviewee: Male, Female"
    * relationship_with_head: "The interviewee's relationship with the head of the house:Head of Household, Spouse, Child, Parent, Other relative, Other non-relatives, Dont know"
    * marital_status: "The martial status of the interviewee: Married/Living together, Divorced/Seperated, Widowed, Single/Never Married, Don't know"
    * education_level: "Highest level of education: No formal education, Primary education, Secondary education, Vocational/Specialised training, Tertiary education, Other/Dont know/RTA"
    * job_type: "Type of job interviewee has: Farming and Fishing, Self employed, Formally employed Government, Formally employed Private, Informally employed, Remittance Dependent, Government Dependent, Other Income, No Income, Dont Know/Refuse to answer"


## 5. Loading the data:

### 5.1. Setup:

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import os.path as path
%matplotlib inline

WORKING_DIR = "."
DATA_PATH = path.join(WORKING_DIR, "Data")

### 5.2. Loading data:

In [4]:
def load_data(data_path=DATA_PATH, file_name="Train_v2.csv"):
    csv_path = path.join(data_path, file_name)
    return pd.read_csv(csv_path)

In [5]:
fin_data = load_data()

## 6.1. Data exploration:

In [6]:
fin_data.head(15)

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed
5,Kenya,2018,uniqueid_6,No,Rural,No,7,26,Female,Spouse,Married/Living together,Primary education,Informally employed
6,Kenya,2018,uniqueid_7,No,Rural,Yes,7,32,Female,Spouse,Married/Living together,Primary education,Self employed
7,Kenya,2018,uniqueid_8,No,Rural,Yes,1,42,Female,Head of Household,Married/Living together,Tertiary education,Formally employed Government
8,Kenya,2018,uniqueid_9,Yes,Rural,Yes,3,54,Male,Head of Household,Married/Living together,Secondary education,Farming and Fishing
9,Kenya,2018,uniqueid_10,No,Urban,Yes,3,76,Female,Head of Household,Divorced/Seperated,No formal education,Remittance Dependent


In [7]:
fin_data.tail(15)

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
23509,Uganda,2018,uniqueid_2099,No,Rural,No,4,20,Female,Spouse,Married/Living together,Primary education,Other Income
23510,Uganda,2018,uniqueid_2100,No,Rural,Yes,4,30,Female,Spouse,Married/Living together,No formal education,Other Income
23511,Uganda,2018,uniqueid_2101,No,Rural,No,6,19,Female,Parent,Single/Never Married,Secondary education,No Income
23512,Uganda,2018,uniqueid_2102,No,Rural,No,2,57,Female,Head of Household,Divorced/Seperated,No formal education,Other Income
23513,Uganda,2018,uniqueid_2103,No,Urban,Yes,7,26,Female,Head of Household,Married/Living together,Secondary education,No Income
23514,Uganda,2018,uniqueid_2107,No,Urban,Yes,6,24,Female,Spouse,Married/Living together,Primary education,Self employed
23515,Uganda,2018,uniqueid_2108,No,Rural,No,6,16,Male,Parent,Single/Never Married,Primary education,Other Income
23516,Uganda,2018,uniqueid_2109,No,Urban,Yes,3,35,Male,Head of Household,Married/Living together,Primary education,Self employed
23517,Uganda,2018,uniqueid_2110,No,Urban,Yes,9,16,Male,Parent,Single/Never Married,Primary education,Other Income
23518,Uganda,2018,uniqueid_2111,No,Rural,Yes,9,20,Female,Child,Single/Never Married,Primary education,No Income


In [8]:
fin_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
country                   23524 non-null object
year                      23524 non-null int64
uniqueid                  23524 non-null object
bank_account              23524 non-null object
location_type             23524 non-null object
cellphone_access          23524 non-null object
household_size            23524 non-null int64
age_of_respondent         23524 non-null int64
gender_of_respondent      23524 non-null object
relationship_with_head    23524 non-null object
marital_status            23524 non-null object
education_level           23524 non-null object
job_type                  23524 non-null object
dtypes: int64(3), object(10)
memory usage: 2.3+ MB


In [9]:
fin_data.describe()

Unnamed: 0,year,household_size,age_of_respondent
count,23524.0,23524.0,23524.0
mean,2016.975939,3.797483,38.80522
std,0.847371,2.227613,16.520569
min,2016.0,1.0,16.0
25%,2016.0,2.0,26.0
50%,2017.0,3.0,35.0
75%,2018.0,5.0,49.0
max,2018.0,21.0,100.0


In [11]:
cat_features = fin_data.select_dtypes(exclude=["int64"]).columns
cat_features = list(cat_features)
cat_features.extend(["year"]) # check the last section in the notebook for the explanation why "year" is categorical and not numerical

In [12]:
cat_features

['country',
 'uniqueid',
 'bank_account',
 'location_type',
 'cellphone_access',
 'gender_of_respondent',
 'relationship_with_head',
 'marital_status',
 'education_level',
 'job_type',
 'year']

In [23]:
for cat in cat_features:
    print(fin_data[cat].value_counts(), "\n ******")
    

Rwanda      8735
Tanzania    6620
Kenya       6068
Uganda      2101
Name: country, dtype: int64 
 ******
uniqueid_710     4
uniqueid_744     4
uniqueid_1417    4
uniqueid_1044    4
uniqueid_1533    4
                ..
uniqueid_6797    1
uniqueid_7218    1
uniqueid_7436    1
uniqueid_7501    1
uniqueid_8600    1
Name: uniqueid, Length: 8735, dtype: int64 
 ******
No     20212
Yes     3312
Name: bank_account, dtype: int64 
 ******
Rural    14343
Urban     9181
Name: location_type, dtype: int64 
 ******
Yes    17454
No      6070
Name: cellphone_access, dtype: int64 
 ******
Female    13877
Male       9647
Name: gender_of_respondent, dtype: int64 
 ******
Head of Household      12831
Spouse                  6520
Child                   2229
Parent                  1086
Other relative           668
Other non-relatives      190
Name: relationship_with_head, dtype: int64 
 ******
Married/Living together    10749
Single/Never Married        7983
Widowed                     2708
Divorced/Seper

In [17]:
fin_data["country"].value_counts()

Rwanda      8735
Tanzania    6620
Kenya       6068
Uganda      2101
Name: country, dtype: int64

#### Observations:
* the data was correctly loaded (comparing the number of rows in the dataframe and the Train_v2.csv file, it's the same 23524).
* there's no missing data in this set.

## Ressources:

* * It would make no sense if year were quantitative. What does it mean to average two years? For example, is there any meaning to (2015+2016)/2? What about multiplying two years? (1995×2016) Based on these examples, it would be more reasonable to classify year as a categorical variable. ([an answer on Quora](https://www.quora.com/Is-year-a-quantitative-or-categorical-variable))
* https://elitedatascience.com/feature-engineering