# Beginner Python and Math for Data Science
## Lecture 6 - Challenge
### Matplotlib

### Real Data Set
Let's switch gear and work with a real data set from homes being sold in the Seattle area.

Let's start by reading and exploring the data.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# this statement allows the visuals to render within your Jupyter Notebook
%matplotlib inline 

In [None]:
data = pd.read_csv('data/SeattleHomePrices.csv')

In [None]:
data.head()

### Problem 1
Let's use a subset of the data set that only contains the following columns: PROPERTY TYPE, ZIP, PRICE, BEDS, BATHS, SQUARE FEET, DAYS ON MARKET. Save it in a Pandas Dataframe named small_data.

In [None]:
### Write the code here

small_data = data[['PROPERTY TYPE', 
                   'ZIP', 
                   'PRICE',
                   'BEDS',
                   'BATHS', 
                   'SQUARE FEET', 
                   'DAYS ON MARKET']]


In [None]:
#small_data['BATHS'].isNull()

### Problem 2

    *Show a sample of 5 houses
    *Get a summary of the numerical values

In [None]:
### Write code here

#small_data.sample(5)
small_data.describe()


In [None]:
small_data.info()

### Problem 3
Our data set contains 350 houses.  However, there are some NULL values in SQUARE FEET and BATHS.  Let's drop the houses with at least 1 NULL value in any of the columns of the dataset.

**Hint:** Pandas dropna

In [None]:
### Write the code here

small_data = small_data.dropna()


In [None]:
small_data.info()

### Problem 4
Create a histograms that shows the distribution of the number of BEDS

In [None]:
### Write the code here
plt.hist(small_data['BEDS'], bins = range(12))



Which is the most common number of bedrooms?

### Problem 5

Create a pie chart that shows the percenatge of houses with more than 2 bathrooms, compared to 2 or less bathrooms.

In [None]:
### Write the code here

plt.pie([sum(small_data.BATHS>2),sum(small_data.BEDS <= 2)])



### Problem 6

Create two scatter plots that shows the PRICE in the $y$ axis and the SQUARE FEET in the $x$-axis.

1) One scatter plot should be in a normal scale.

2) The other scatter plot should have a log-log scale.

In [None]:
### Write the code here
plt.scatter(small_data['SQUARE FEET'], small_data['PRICE'])
plt.xscale('log')
plt.yscale('log')



### Problem 7 (Challenge)

Create a bar chart that shows the different PROPERTY TYPEs in the $x$-axis, and the average days on the market on the $y$-axis.

In [None]:
### Write the code here

plt.figure(figsize = (15, 5))
mean_data = small_data.groupby('PROPERTY TYPE').agg(np.mean)
#mean_data

plt.bar(mean_data.index, mean_data['DAYS ON MARKET'])
plt.xticks(rotation = 90)


Which PROPERTY TYPE spends the most amount of days in the market?

In [None]:
from sklearn import cluster
model = cluster.KMeans(n_clusters = 3)
model.fit(small_data[['PRICE','SQUARE FEET']])

small_data['labels'] = model.labels_

plt.scatter(small_data['PRICE'], small_data['SQUARE FEET'], c = small_data['labels'])

In [None]:
plt.figure(figsize = (15,10))
mean_data_yb = data.groupby('YEAR BUILT').agg(np.mean)
mean_data_yb
plt.bar(mean_data_yb.index, mean_data_yb['PRICE'])
plt.xticks(rotation = 90)

In [None]:
titanic_data = pd.read_csv('data/train.csv')
titanic_data.head()

In [None]:
# FEEL FREE TO VISUALIZE THESE RESULTS
# 1) What % of passengers on the titanic are male / female?
# 2) What is the mean age of people on board?  What % of people were under 18?
# 3) Did male/female passengers have different survival rates?
# 4) Is there a correlation between the fare that they paid and their survival rate?
# 5) Anything else you want to investigate?

data['fare_adjusted'] = [int(x/10)*10 for x in data['fare']]

In [None]:
mean_age = np.mean(titanic_data['Age'])
titanic_data['Age'] = titanic_data['Age'].fillna(mean_age)
titanic_data.describe()

In [None]:
tfemale = sum(titanic_data['Sex']=='female')
tmale = sum(titanic_data['Sex']=='male')
plt.figure(figsize=(5,5))
plt.pie([tfemale, tmale])
#tfemale

In [None]:
survive = (titanic_data.groupby('Sex').agg(np.mean))
print(survive)

### SOLUTIONS

### Problem 1
Let's use a subset of the data set that only contains the following columns: PROPERTY TYPE, ZIP, PRICE, BEDS, BATHS, SQUARE FEET, DAYS ON MARKET. Save it in a Pandas Dataframe named small_data.

In [None]:
small_data = data[['PROPERTY TYPE', 'ZIP', 'PRICE', 'BEDS', 'BATHS', 'SQUARE FEET', 'DAYS ON MARKET']]

### Problem 2

    *Show a sample of 5 houses
    *Get a summary of the numerical values

In [None]:
small_data.sample(5)

In [None]:
small_data.describe()

### Problem 3
Our data set contains 350 houses.  However, there are some NULL values in SQUARE FEET and BATHS.  Let's drop the houses with at least 1 NULL value in any of the columns of the dataset.

**Hint:** Pandas dropna

In [None]:
small_data.dropna(inplace = True)

### Problem 4
Create a histograms that shows the distribution of the number of BEDS

In [None]:
plt.figure(figsize=[10,5])

# plt.hist(small_data['BEDS'])
plt.hist(small_data['BEDS'],small_data['BEDS'].max())
plt.title('Distribution of Number of Bedrooms');

### Problem 5

Create a pie chart that shows the percenatge of houses with more than 2 bathrooms, compared to 2 or less bathrooms.

In [None]:
greater = len(small_data[small_data['BATHS']>2])
lessthan = len(small_data[small_data['BATHS']<=2])

plt.figure(figsize=[5,5])
plt.pie([greater,lessthan],labels=['Greater','Less Than']);

### Problem 6

Create two scatter plots that shows the PRICE in the $y$ axis and the SQUARE FEET in the $x$-axis.

1) One scatter plot should be in a normal scale.

2) The other scatter plot should have a log-log scale.

In [None]:
plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
plt.scatter(small_data['SQUARE FEET'],small_data['PRICE'],alpha = 0.5,s=10);

plt.subplot(1,2,2)
plt.loglog(small_data['SQUARE FEET'],small_data['PRICE'],'r.',alpha = 0.5);

### Problem 7 (Challenge)

Create a bar chart that shows the different PROPERTY TYPEs in the $x$-axis, and the average days on the market on the $y$-axis.

In [None]:
PropType = small_data.groupby(['PROPERTY TYPE'])['DAYS ON MARKET'].mean()
plt.bar(range(len(PropType)),PropType.values)
plt.xticks(range(len(PropType)),PropType.index,rotation = 90)
plt.grid();