[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/TobGerken/ISAT300/blob/main/1_PracticeChallenge.ipynb)

This notebook assumes that you have completed the lecture data-analysis notebook 

1. [Getting Started](https://colab.research.google.com/github/TobGerken/ISAT300/blob/main/1_GettingStarted.ipynb)

# First Practice Challenge

**This notebook is published on my github. It is publicly accessible, but you cannot save your changes to my github. Learning git & github is beyond the scope of this course. If you are familiar with github, you know what to do. If you don't know github, you can save a personal copy of the file to your google drive, so that you can save your changes and can access them at a later date**

<img src="https://raw.githubusercontent.com/TobGerken/ISAT300/main/Figures/SaveFile.png " alt="drawing" width="800"/>


## Learning Goals 

This exercise is designed to for you to practice some of the data analysis skills that we have encountered in the first data analysis lesson. 
We will apply them to a commonly used data-set of house prices. 

After completing this exercise, you should be able to:

- use pandas with google colab to read data into a dataframe object
- understand how data is organized into rows and columns
- select one or multiple columns from the dataset
- apply descriptive statistics methods to data in columns
- select data based on a condition


## Now lets get started 



Because pandas is not part of the core python language we have to import it as a module:

In [2]:
# running this will import pandas.
import pandas as pd

## Reading data into a pandas dataframe

The housing data is both saved online (`'https://raw.githubusercontent.com/TobGerken/ISAT300/main/Data/Housing.csv'`) and to canvas. 

See how we did it in lecture and load the data to a dataframe


In [5]:
# This loads the data
df = pd.read_csv('https://raw.githubusercontent.com/TobGerken/ISAT300/main/Data/Housing.csv')
df

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished


## Exploring the data 

Let's find out how the data looks. Display it and find out what each column means: 

We can also apply the `.info()` to the dataframe to learn more about our dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


You also want to find the shape (e.g. how many entries you have): 

## Selecting Data and Statistics

As a next step, lets find out a bit more about our dataset. Try the selectring the column with the house price and find the mean price of the houses in the dataset. 

In [11]:
df['price'].mean()

4766729.247706422

We also want to find the standard deviation of the house prices: 

We can also use the `.value_counts()` method to count for example how many of the houses have air conditioning. Give it a try by selecting the right columns and by applying the method. 

In [12]:
# .head will display the first couple of rows of the dataframe
df['airconditioning'].value_counts()

airconditioning
no     373
yes    172
Name: count, dtype: int64

## First Data Analysis 

Let's try to find out, whether houses with air condition are more expensive than houses without air conditioning. 

To do so we can select only houses with air conditioning and save them to a new dataframe. 
We can then do the samve of houses wihout air condition. 

In [18]:
df_ac = df.loc[df['airconditioning']== 'yes'] # houses with AC
df_no_ac = df.loc[df['airconditioning']== 'no'] # houses without AC

Let's first look at them again:

In [20]:
df_ac

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
5,10850000,7500,3,3,1,yes,no,yes,no,yes,2,yes,semi-furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
457,3115000,3000,3,1,1,no,no,no,no,yes,0,no,unfurnished
461,3080000,4960,2,1,1,yes,no,yes,no,yes,0,no,unfurnished
504,2653000,3185,2,1,1,yes,no,no,no,yes,0,no,unfurnished
505,2653000,4000,3,1,2,yes,no,no,no,yes,0,no,unfurnished


As a next step, let's calculate means and standard deviations of the price column for both: 

In [21]:
df_ac['price'].mean()

6013220.5813953485

In [22]:
df_ac['price'].std()

1998149.4749927232

In [23]:
df_no_ac['price'].mean()

4191939.678284182

In [24]:
df_no_ac['price'].std()

1493711.7609608262

Based on these results, do you think there is a difference? 

In [None]:
# Feel free to typ your answer here

## Wrapping up: 

### Don't forget to save your changes before leaving colab

This was a brief practice, for you to apply what we previously learned on a new dataset. Your next data analysis will be for the lab on Friday and we will do more in class next week.

### Learning Goals

After completing this exercise, you should be able to:

- use pandas with google colab to read data into a dataframe object
- understand how data is organized into rows and columns 
- select one or multiple columns from the dataset
- apply descriptive statistics methods to data in columns
- select data based on a condition


