### Now for the main event.....ANALYSING DATA!

We will be using pandas in python in order to analyse a dataset.
First we need to import all the necessary packages.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

First, we have to read in the data. If we have a csv, we can use the following function to read the data.

In [None]:
housing_data = pd.read_csv("sample_data/california_housing_train.csv")

What if we want to know the column names in the dataset?

In [None]:
housing_data.columns

We can also try and look at a few rows of our dataset

In [None]:
housing_data.head()

In [None]:
housing_data.head(10)

It's possible to index into a specific row as well

In [None]:
housing_data.iloc[0]

We might also want to know the size of our dataset

In [None]:
housing_data.shape

In [None]:
len(housing_data)

Another common concern would be about whether there are any NAs in the dataset

In [None]:
housing_data.isna().any()

We might want to generally get a sense for the summary statistics in our dataset, such as mean, mode, etc.

In [None]:
housing_data.describe()

We might also make a histogram to visualize some of our columns

In [None]:
#plot
plt.rcParams['figure.figsize'] = 16, 12
housing_data[['total_rooms', 'total_bedrooms']].hist()
plt.show()

For more on matplotlib, visit: https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html

Or a scatterplot to understand the relationship between two variables:

In [None]:
import seaborn as sns
sns.scatterplot(x = 'total_bedrooms', y = 'median_house_value', data = housing_data)

Take a look at the data again. Is it categorical or is it numerical? If it were categorical, there are different summary statistics we could use. Let's convert one of the columns to being categorical using binning.

In [None]:
housing_data['median_house_value_discrete'] = pd.qcut(housing_data['median_house_value'], 8)

In [None]:
housing_data.head()

Sometimes it is helpful to know how many observations we have of each categorical variable.

In [None]:
housing_data[['median_house_value_discrete']].value_counts()

What if we want to convert into a prettier dataframe?

In [None]:
housing_data[['median_house_value_discrete']].value_counts().reset_index(name="counts per home value category")

How about ordering the categories?

In [None]:
housing_data[['median_house_value_discrete']].value_counts().reset_index(name="counts per home value category").sort_values(ascending=True, by='median_house_value_discrete')

In [None]:
housing_data.groupby(['median_house_value_discrete']).size().reset_index(name="counts per home value category").sort_values(ascending=True, by='median_house_value_discrete')