# LAB 2 :: Python

## Get familier with Colab Notebook

February 22, 2024

Learning outcome:

1.   Get familiar with Colab interactive python Notebook
2.   Try some small exercises on the pre-work of data analysis, such as data loading, exploratory data analysis, visualization, etc.



**Lets import libraries**

**1 Load the housing dataset using pandas**

In [None]:
import pandas as pd
import os
import tarfile

*1.1 Download housing.tgz from LMS and save it in either local directory or google drive directory*

*Note: You can also upload into colab content directory*

In [None]:
from google.colab import files
uploaded = files.upload()

Go to *Files -> contents* to see the uploaded files and sample datasets

*1.2 Mount to the Drive*

In [None]:
from google.colab import drive
drive.mount('/mntDrive')

In [None]:
# copy file into drive (optional-if needed)
!cp housing.tgz /mntDrive/MyDrive/ColabNotebooks

*1.3 Extract tar*

In [None]:
notebooks = "/mntDrive/MyDrive/ColabNotebooks"
tgz_path = os.path.join(notebooks,"housing.tgz")
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=os.path.join(notebooks,"Data"))
housing_tgz.close()

*1.4 Write a small function to load the data:*

In [None]:
def load_housing_data(housing_path=None):
  csv_path = os.path.join(notebooks, "Data", "housing.csv")
  return pd.read_csv(csv_path)

**Do ::** Check the top five rows using the head() method. <br>
**Think ::** Pay attention to the attributes of a new dataset. <br>

In [None]:
housing = load_housing_data()
housing.head()

**Find ::** How many attributes? What are they?

Each row represents one district. <br> There are 10 attributes: longitude, latitude,housing_median_age,total_rooms, total_bed rooms, population, households, median_income, median_house_value, and ocean_proximit. <br> Later we will treate *median_house_value* as the output.

1.5 Alternatively, you can use info() method to get a quick description of the data.
What is each attribute's data type?

In [None]:
housing.info()

All attributes are numerical, except the ocean_proximity field (type is object), so it could hold any kind of Python object, from the CSV file we know that it must be a text attribute.
<br>


**1.6 Do: Find out what categories exist in `ocean_proximity', and how many districts belong to each category using *value_count()* method.**

In [None]:
housing["ocean_proximity"].value_counts()

1.7 Next, describe() method shows a summary of the numerical attributes.

In [None]:
housing.describe()

**Query:** What doi you observe?

total_bedrooms -> 20,433, not 20,640.
This is because the null values are ignored.

Anything else?

Let's plot a histogram for each numerical attribute to get a feel of the data we are dealing with. <br> A
histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).

Before we can plot anything, we need to specify which backend Matplotlib should use.
We will use Jupyter's magic command %matplotlib inline -> This tells Jupyter to set up Matplotlib,
so it uses Jupyter's own backend. Plots are then rendered within the notebook itself.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
housing.hist(bins=50, figsize=(20,15))
plt.show()

Query: Do you observe any problems with these histograms?
1. These attributes have very different scales.
<br>
2. Many histograms are tail heavy: they extend much
farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will need to transform these attributes later on to have more bell-shaped distributions.
<br>
3. Median house value were capped. It may be a serious problem
since it is your target attribute (your labels). Your Machine Learning algorithms may learn that prices never go beyond that limit. Possible options: a. Collect proper labels for the districts whose
labels were capped. b. Remove those districts from the training set and the test set.

**2 Create a Test Set**

The idea of creating a test set is simple: pick some instances randomly,typically 20% of the dataset (the ratio may vary), and set them aside.

In [None]:
import numpy as np

In [None]:
from sklearn.model_selection import train_test_split

*2.1 Random sampling*

In [None]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

**Query:** Can you create five random sets of separate train and test splits? <br>
**Do:** Create a list/set containing both five of train_sets and test_sets

*2.2 Stratifed sampling*

We're told that the median income is essential in predicting median housing prices. So we want to ensure that the test set is representative of the various categories of incomes in the whole dataset.

The following code uses the *pd.cut()* function to create an income category attribute with five
categories:
<br>
category 1 0-1.5
<br>
category 2 1.3-3,
<br>
and so on

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1,2,3,4,5])

In [None]:
housing["income_cat"].hist()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

In [None]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
  strat_train_set = housing.loc[train_index]
  strat_test_set = housing.loc[test_index]

Check the income category proportions in the test set

In [None]:
strat_test_set["income_cat"].value_counts()/len(strat_test_set)

Now we need to delete the `income_cat' attribute, so the data is back to its original state.

In [None]:
for set_ in (strat_train_set, strat_test_set):
  set_.drop("income_cat", axis=1, inplace=True)

# 3 Discover the dataset - visualization
Next we will only explore the trainig data set and put the testing set aside. Let's create a copy so
that the following procedures will not harm the training

In [None]:
housing = strat_train_set.copy()

Do: Let's first create a scatterplot of all districts to visualize the data.

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")

We can observe an overplotting issue, making it difficult to see individual data points in a data visualization

We can adjust the alpha option to make the visualization better refect the high density of data
points.

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

Query: Can you visualise the lattitude and longitude relationship using other kind of plot? <br>
Go thorugh https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.html
*   Identify and observe the changes
*   Comment on the intuitive plot (s)



# 4 Correlations between attributes
We can easily compute the standard correlation coeffcient between every pair of attributes
For example, let's check how much each attribute correlates with the median house value:

In [None]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

How to interpret the results? <br>
The correlation coffcient ranges from -1 to 1. When it is close to 1, it means that there is a strong positive correlation;  <br>
for example, the median house value tends to go up when the median income
goes up.
<br>
When the coeffcient is close to -1, it means that there is a strong negative correlation; <br>
You can see a small negative correlation between the latitude and the median house value (i.e., prices have a slight tendency to go down when you go north). Finally, coefficients close to zero mean that there is no linear correlation.

Alternatively, we can check for correlations between attributes using the pandas *scatter_matrix()* function. <br>
Let's focus on a few promising attributes that seem most correlated with the median housing value.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
attributes = ["median_house_value", "median_income", "total_rooms","housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

Which attributes seem to be more predictable of the median house value?
<br>


*First,* correlation between the median house value and median income is indeed very strong;
you can clearly see the upward trend and the points are not too dispersed. <br>
*Second*, the price cap that
we noticed earlier is clearly visible as a horizontal line at 500,000. But this plot reveals other less obvious straight lines: a horizontal line around 450,000, another around 350,000, perhaps one around 280,000, and a few more below that. You may want to try removing the corresponding
districts to prevent your algorithms from learning to reproduce these data quirks. <br>

Query: Can you visulase the correlation between variables. <br>
Do:Go thrugh the documentation about python packages: heatmapz/ seaborn

# **Discuss in groups and answer the following questions:**:
1. Can you estimate the median house value from a set of variable inputs (except median house value)?
2. Is it a machine learning problem?
3. What sort of machine learning problem it is?
4. Which important variables are statistically contributing for the house value?
5. advance: Can you form the group of districts based on their attribute/feature/variable (in this context) values?

*Note: Support your answer with any sort of logical reasoning.*

** Disclaimer: The above code is modifed from the textbook "Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow".