[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/TobGerken/ISAT300/blob/main/1_GettingStarted.ipynb)

# Getting Started with Data in Jupyter

**This notebook is published on my github. It is publicly accessible, but you cannot save your changes to my github. Learning git & github is beyond the scope of this course. If you are familiar with github, you know that to do. If you don't know github, you can save a personal copy of the file to your google drive, so that you can save your changes and can access them at a later date**

<img src="https://raw.githubusercontent.com/TobGerken/ISAT300/main/Figures/SaveFile.png " alt="drawing" width="800"/>

## Now lets get started 

We are starting with some initial data analysis. 
[Pandas](https://pandas.pydata.org/) is a powerful data analysis tool built on top of [python](www.python.org) to read, manipulate, and visualize data. Much like Excel, python organizes data in Tables, which it calls dataframes. 

Because pandas is not part of the core python language we have to import it as a module:

In [None]:
# running this will import pandas.
import pandas as pd

## Reading data into a pandas dataframe

To look at some data, we have to get it into a pandas dataframe. Pandas has a lot of different functions to read data that is saved as a file. 

I saved a dataset containing MPG values for different cars to the cloud. This file is a CSV file, which stands for 'comma separated values'. We can load the content of this file into a dataframe object that we call `df`.

In [None]:
# This loads the data, which is saved online 
df = pd.read_csv('https://raw.githubusercontent.com/TobGerken/ISAT300/main/Data/mpg_cated.csv')

When using [Google Collab](https://colab.research.google.com) you can either load data that is stored in the cloud or you can upload data into collab. 
To do so follow the steps shown below: 

<img src="https://raw.githubusercontent.com/TobGerken/ISAT300/main/Figures/UploadingAFile.PNG " alt="drawing" width="800"/>


Assuming the file is now uploaded to Collab, you can load it to the dataframe by specifying the local path

In [None]:
# don't execute this cell, because we have not uploaded the data.
# We have already read the data from the web above. 
df = pd.read_csv('./mpg_cated.csv')

Let's have a look at the data

In [None]:
df

**Q: What do we learn from this look?**

In [None]:
# if you want to you can write it in here: 


Here are a few useful commands for exploring the basic data:

In [None]:
# This will display the first couple of rows
df.head()

In [None]:
# This will give you the dimension of the data
df.shape

## Selecting Data 

Dataframes can be very big and can also have many colums, so sometimes we just want to select a small portion of the dataframe. 

For example, we can select only the `origin` column. 

In [None]:
df['origin']

In [None]:
# more than one column is selected like this
df[['mpg', 'horsepower']]

Getting the value counts for categorical and discrete variables is also useful.
_Recall your statistics class about `categorical`, `discrete`, and `continuous` data_ 

In [None]:
df[['origin','cylinders']].value_counts()

**Q: What is being displayed here?**

**Q: What happens if you try this for continuous variables?**

In [None]:
## Try it out for the mpg and displacement columns. Is this information useful? 



# Most basic descriptive statistics 

We can also generate some very basic descriptive statistics by using panda's `.describe()`

In [None]:
df.describe()

**Q: What do you notice?**

Sometimes, we are only interested in a few statistics and we can calculate these directly. 

In [None]:
# Here is an example:
df['mpg'].mean()

In [None]:
# Try calculating the sum (hint: .sum()); median; minimum (.min()), maximum (.max()) in this cell:



## Creating New Columns and Doing Math

Data analysis required data manipulation and storing the results. 

For example, we might want convert the weight of the car from pounds to kg like this. 


In [None]:
df['weight_kg']=df['weight']*0.454 # There are 0.454 lbs in a kg
df.head()

You can even use your calculated statistics to for example find the deviation from the mean. 

In [None]:
df['weight_kg_deviation']=df['weight_kg']-df['weight_kg'].mean()
print('The mean value is:', df['weight_kg'].mean())
df.head()
# It looks like the US cars that we see are heaver than average, but we cannot be sure from a few data points. 

## A first attempt at data analysis 

Maybe we want to find out how US and European cars compare in terms of gas milage. Let's find out. For this we have to learn how to select only the europan and us cars. Luckily pandas can easily do this, because it understands conditionals (you should remember these from your programming class). 

In [None]:
df['origin']== 'europe'

In [None]:
# We can now select all cars that are made in europe and save these to a new dataframe. 
# the loc command helps us subset the data based on a condition. 
df_european = df.loc[df['origin']== 'europe']
df_european.head()

In [None]:
# Why don't you do the same thing for us cars and then calculate the average gas milage for european and us cars.  
# How do they compare? 
