[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/TobGerken/ISAT300/blob/main/1_GettingStarted.ipynb)

# Getting Started with Data in Jupyter

**This notebook is published on my github. It is publicly accessible, but you cannot save your changes to my github. Learning git & github is beyond the scope of this course. If you are familiar with github, you know that to do. If you don't know github, you can save a personal copy of the file to your google drive, so that you can save your changes and can access them at a later date**

<img src="https://raw.githubusercontent.com/TobGerken/ISAT300/main/Figures/SaveFile.png " alt="drawing" width="800"/>


## Learning Goals 

We are starting with some initial data analysis. 
[Pandas](https://pandas.pydata.org/) is a powerful data analysis tool built on top of [python](www.python.org) to read, manipulate, and visualize data. Much like Excel, python organizes data in Tables, which it calls dataframes. 

Pandas and python have become one of the go-to's in data science for protyping and developing data analysis frameworks. 

After completing this exercise, you should be able to:

- use pandas with google collab to read data into a dataframe object
- understand how data is organized into rows and columns
- select one or multiple columns from the dataset
- apply descriptive statistics methods to data in columns
- select data based on a condition


## Now lets get started 



Because pandas is not part of the core python language we have to import it as a module:

In [None]:
# running this will import pandas.
import pandas as pd

## Reading data into a pandas dataframe

To look at some data, we have to get it into a pandas dataframe. Pandas has a lot of different functions to read data that is saved as a file. 
All the functions that pandas can be accessed by chaining them with a `.`.

For example the `pd.read_csv()` function can be used to load data for analysis. 

I saved a dataset containing MPG values for different cars to the cloud. This file is a CSV file, which stands for 'comma separated values'. We can load the content of this file into a dataframe object that we call `df`.

In [None]:
# This loads the data, which is saved online 
df = pd.read_csv('https://raw.githubusercontent.com/TobGerken/ISAT300/main/Data/mpg_cated.csv')

When using [Google Collab](https://colab.research.google.com) you can either load data that is stored in the cloud or you can upload data into collab. 
To do so follow the steps shown below: 

<img src="https://raw.githubusercontent.com/TobGerken/ISAT300/main/Figures/UploadingAFile.PNG " alt="drawing" width="800"/>


Assuming the file is now uploaded to Collab, you can load it to the dataframe by specifying the local path. To do so, right click on the file and select copy path (see figure below).  

<img src="https://raw.githubusercontent.com/TobGerken/ISAT300/main/Figures/getFilePath.png" alt="drawing" width="800"/>


This will return the location of the file that we want to load. It should be `/content/mpg_cated.csv`, which we can then use to load the data. You can try this out below. ***Obviously, you will have needed to upload the data to collab first.*** 

In [None]:
# this will load the file if it was uploaded to collab
df = pd.read_csv('/content/mpg_cated.csv')

Let's have a look at the data. To do, so we can simply type the name of variable the data frame is stored in: `df`

In [None]:
df

We can now see a preview of the data we just loaded for analysis. 

**Questions:**
- **How would you describe what you see?**
- **What can you say about the data format?**
- **What can you say about the data itself?**

In [None]:
# if you want to you can write it in here: 


## Exploring the Data 

Here are a few useful commands for exploring the basic data. In pandas, we can apply methods to our dataframe. They are also chained with a `.`.
There are a lot of them and you will be getting to know a lot of more them throughout the semester. 

For now, let's just try a couple: `.head()`,`.shape`,`.info()`. Note that some have `()` after them. The parentheses are necessary!

In [None]:
# .head will display the first couple of rows of the dataframe
df.head()

In [None]:
# shape will give you the dimension of the data
df.shape

The `.info()` method provides some additional information about the data. 

**Q: Can you use the guess what the difference is between `object` and `int64` in the `Dtype` column?**

In [None]:
df.info()

## Selecting Data 

Dataframes work like tables (think Excel). They have rows and columns. Th 

Some dataframes can be very big with many rows and many columsn, so sometimes we just want to select a small portion of the dataframe. 

Often, we are interested in a specific column of the dataframe and we can select these by the _column names_. To do so, we put the column name in square brackets `[<column name>]`

For example, we can select only the `origin`-column (note the `''` to denote that the column name is a _string_ and not a variable) in our dataframe like this :  

In [None]:
df['origin']

If we want to select more than one column, we can supply a list of column names. Like this. 

In [None]:
# more than one column is selected like this
df[['mpg', 'horsepower']]

In [None]:
# Try selecting three other columns from the dataframe. For example: model_year, car_company, and weight
# complete the code below: 
df[]

Getting the value counts for categorical and discrete variables is also useful.
_Recall your statistics class about `categorical`, `discrete`, and `continuous` data_ 

In [None]:
df[['origin','cylinders']].value_counts()

**Q: What is being displayed here?**

**Q: What happens if you try this for continuous variables?**

In [None]:
## Try it out for the mpg and displacement columns. Is this information useful? 



# Most basic descriptive statistics 

We can also generate some very basic descriptive statistics by using panda's `.describe()`

In [None]:
df.describe()

**Q: What do you notice?**

Sometimes, we are only interested in a few statistics and we can calculate these directly. 

In [None]:
# Here is an example:
df['mpg'].mean()

In [None]:
# Try calculating the sum (hint: .sum()); median; minimum (.min()), maximum (.max()) in this cell:



Whenever we have data, we need to understand its variation (or uncertainty). Let's find the _standard deviation_ (`.std()`) of the `'mpg'`-column in the dataframe. 

In [None]:
# Try this out 

## Creating New Columns and Doing Math

Data analysis required data manipulation and storing the results. 

For example, we might want convert the weight of the car from pounds to kg like this. 


In [None]:
df['weight_kg']=df['weight']*0.454 # There are 0.454 lbs in a kg
df.head()

You can even use your calculated statistics to for example find the deviation from the mean. 

In [None]:
df['weight_kg_deviation']=df['weight_kg']-df['weight_kg'].mean()
print('The mean value is:', df['weight_kg'].mean())
df.head()
# It looks like the US cars that we see are heaver than average, but we cannot be sure from a few data points. 

## A first attempt at data analysis 

Maybe we want to find out how US and European cars compare in terms of gas milage. Let's find out. For this we have to learn how to select only the europan and us cars. Luckily pandas can easily do this, because it understands conditionals (you should remember these from your programming class). 

In [None]:
df['origin']== 'europe'

In [None]:
# We can now select all cars that are made in europe and save these to a new dataframe. 
# the .loc command helps us subset the data based on a condition. 
df_european = df.loc[df['origin']== 'europe']
df_european.head()

In [None]:
# Why don't you do the same thing for U.S. cars and then calculate the average gas milage for european and U.S cars.  



## Wrapping up: 

### Don't forget to save your changes before leaving collab

This was a brief run-down on some of the basic functions of Google Collab and _pandas_ and how we can use them to begin analyzing data. Similar to Excel, we have created a table, that we can use to perform calculations and statistical analysis from. 

Next time, we will learn how to make plots using _pandas_. 

One advantage of pandas over Excel is that every analysis step is written as part of a computer program, which means that we can easily change our calculations or change the data, without having to redo the entire sheet. 

### Learning Goals

After completing this exercise, you should be able to:

- use pandas with google collab to read data into a dataframe object
- understand how data is organized into rows and columns 
- select one or multiple columns from the dataset
- apply descriptive statistics methods to data in columns
- select data based on a condition

### Additional Practice

You can try the following practice challenge at home

[First Practice Challenge](https://github.com/TobGerken/ISAT300/blob/main/1_PracticeChallenge.ipynb)
