# Open Data Durban Hackathon Challenge

If you have just entered the field of Machine Learning and are interested in getting some experience with some of the basics like underfitting/overfitting or regression/classification - this is the challenge for you. 

**If you have had ANY experience before, we encourage you to please participate in some of our other Hackathon challenges and learn how Machine Learning works in industry!**

*As with all the other challenges, we have many tutors who will be happy to help answer any questions you have or give you hints throughout the hackathon so make sure to pick their brains if you are confused about anything or would like some suggestions for additional resources!*

---


Okay, let's get started! **[Open Data Durban](https://opendata.durban/odd/home)** has generously provided us with the *South African Census Data from 2011,* which will be the dataset we will work with today. Check out this [cool resource](https://wazimap.co.za/) on Census data for some extra info!

Let's first read the data in...If you haven't worked with a mixture of numerical/text data before, one of the best tools for manipulating this data in Python is *[pandas](https://pandas.pydata.org/)*.


In [0]:
# Let's import the pandas library so we can read in our data
import pandas as pd

In [0]:
# Now we use the pd.read_excel function from Pandas to load the xlsx
loaded_odd_data = pd.read_excel("./Census_Data_2011_SALs.xlsx")

As we will manipulate the dataset, let's create a version of the data that we can change without affecting the original.


In [9]:
manipulated_odd_data = loaded_odd_data

# Let's view the first 10 rows of the dataset
manipulated_odd_data[0:10]

Unnamed: 0,SAL_CODE,Province,MN_NAME,MP_NAME,SP_NAME,Gini,SAL area,Population Persons,Population Households,0 - 14,...,Higher,Rented,Owned but not yet paid off,Occupied rent-free,Owned and fully paid off,Cellphone,Percent Institutional / Transient,Percent Foreign Citizen,Percent Province Foreign,Percent Mobile Year Moved
0,1600001,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.613065,47.426901,2379,597,754,...,0.020706,0.188552,0.015152,0.355219,0.436027,0.542714,0.0,0.006321,0.001263,0.003797
1,1600003,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.57166,74.705943,1398,285,502,...,0.0,0.049645,0.0,0.173759,0.677305,0.715789,0.0,0.0,0.0,0.0
2,1600006,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.695913,20.658969,1272,380,310,...,0.019692,0.257895,0.018421,0.544737,0.178947,0.614173,0.0,0.007075,0.0,0.0
3,1600016,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.767308,20.45199,1026,238,319,...,0.013289,0.693277,0.037815,0.193277,0.058824,0.607595,0.0,0.002924,0.0,0.0
4,1600019,Western Cape,Matzikama,Rietpoort,Rietpoort SP,0.582996,3.510544,972,235,306,...,0.017401,0.017094,0.0,0.183761,0.717949,0.594937,0.0,0.003086,0.006192,0.0
5,1600023,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.630553,27.06123,903,212,289,...,0.016109,0.514151,0.009434,0.34434,0.113208,0.614286,0.0,0.0,0.003322,0.0
6,1600024,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.651391,94.649117,897,285,258,...,0.03616,0.233216,0.014134,0.508834,0.169611,0.652632,0.0,0.006689,0.0,0.016667
7,1600034,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.628205,32.526386,771,290,158,...,0.061453,0.348276,0.031034,0.472414,0.131034,0.53125,0.0,0.011673,0.0,0.007752
8,1600040,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.709744,611.608086,699,230,212,...,0.037338,0.489083,0.048035,0.318777,0.069869,0.513158,0.0,0.004292,0.0,0.0
9,1600042,Western Cape,Matzikama,Matzikama NU,Matzikama NU,0.635365,50.427921,690,229,189,...,0.094003,0.298246,0.035088,0.47807,0.153509,0.636364,0.0,0.008696,0.004348,0.0


As you can see, there are many different columns. Be sure to list all the columns to see what data you have to work with. 

**Note**: If you have any questions around what a column descriptor might mean, be sure to ask to tutor who will assist.

---

# Data Pre-processing

So what is the first thing you do when you get new data? Oh you read the title? So you guessed it then! Data pre-processing ^.^

Time for you to do some work! 

## Task 1: Spend some time looking through the data and cleaning it up as much as you can but also learning more about it. If you change the original dataset, be sure to explain why! Anything interesting at this stage? Tell us what!

Some hints on questions you could ask:

* Are there NaNs/missing values in your data?

* Are there any duplicates or is there redundant information?

* What is the distribution of the different columns?

* How many rows/columns do you have? How many seem useful?

* What should you do with categorical columns?


Use the *manipulated_odd_data* variable to store your pre-processed data once you are happy with it and explain all your decisions/changes!

In [0]:
# Please do your pre-processing here

# Regression 

Now that you have cleaned up your data, let's dive right into the Machine Learning. If you have been paying attention in the fundamentals lectures then you probably would have heard the phrase *linear regression* at some point. Linear regression is a basic entry point to Machine Learning. Essentially, it involves learning a linear mapping from a set of x co-ordinates to a set of y-co-orindates i.e. learnng *y = mx + c*. 


---

## Task 2: Use your cleaned dataset to try and predict the *Gini Index *by learning a linear mapping.
(Try not to focus too much on how *good* your solution)

Hint: Check out this [example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html), which uses a cool library called *sklearn*, to help you get started!

In your solution, make sure the following things are included:

* If you do not use some of your data, explain why.

* Why did you choose your training/test set split? (Check out this [link](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7) if you need an explanation on these concepts. Note: A validation set is not necessary for this example!)

* Be sure to present a few metrics to indicate the quality of your solution such as the *mean-squared-error* and the *R2 score*. And tell us why they're good!

* [OPTIONAL]: If you feel comfortable enough, pick 2 columns that you think are important to this task and try and plot your learned linear mapping! Check out the previous example for help on this :) Hint: *matplotlib*.

*Keep your code clean and well commented! You will be evaluated on all aspects of your solution and not necessarily on your final scores.*

---
Once you have completed this, you should be familiar with the following:

* Linear Regression
* Using *sklearn*
* Data Explainability 
* One-hot encoding
* Underfitting/Overfitting
* Evaluation Metrics

In [0]:
# Please do task 1 here

# Classification

Now that you have learned how to do linear regression, time for the other main category in ML: classification! Hopefully you noticed that regression is used when trying to predict continous data. Classification, however, is the approach you take when learning to predict categorical data. By making some tweaks to the linear regression algorithm, we get Logistic Regression, which is used for classification problems. Check out this [link](https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102) for a proper explanation!

## Task 3: Use your cleaned dataset to try and predict the *Province* by using Logistic Regression

(Again, try not to focus too much on how *good* your solution)

Hint: Check out this [example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html) to help you get started!

In your solution, make sure the following things are included:

* If you do not use some of your data, again, explain why.

* Why did you choose your training/test set split?

* Be sure to present your evaluation metrics. Hint: accuracy is a good one!

* Try explain what a decision boundary is.

* [OPTIONAL]: If you feel comfortable enough, pick 2 columns that you think are important to this task and try and plot your learned boundaries! Check out the previous example for help on this :) Hint: *matplotlib*.


---

Once you have completed this, you should now be familiar with the following:

* Logistic Regression
* Decision Boundaries


In [0]:
# Please do task 2 here

# Well done! You have made it through the end (/^.^)/ ~(^.^)~ \\(^.^\\)

[BONUS]: Now, here is a bonus task - once we reach the end of the hackathon, we will give you access to some of the data we kept hidden. Test out your models for Task 1 and Task 2 on this *new* data and show us your scores!

## Once you are happy with your notebook, please submit it to hackathon@indabax.co.za and stand a chance to win a prize! :) 

(Completing optional and bonus tasks will increase your chances so be sure to attempt those)