# Reading in data files and inspecting the resulting dataframe

### Okay, we have some data and we would like to inspect it, wrangle it, and analyze it.
### It of course all starts with getting the data in your pandas dataframe.

### There are many ways to get data into pandas, most have the following syntax:
- `pd.read_csv()`
- `pd.read_excel()`
- `pd.read_parquet()`
- `pd.read_sql()`
- etc. etc.

### See this link for more options:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

### But first, let's start with importing pandas

In [1]:
import pandas as pd

### We check which files are available in the directory with a magic command `%ls`:

In [3]:
%ls

 Volume in drive C is Windows
 Volume Serial Number is F8D3-61BD

 Directory of C:\Users\sandervdo\Downloads\MIT\git\python_training_achmea

11/09/2021  09:38 AM    <DIR>          .
11/09/2021  09:38 AM    <DIR>          ..
11/07/2021  12:59 PM                18 .gitignore
11/09/2021  08:41 AM    <DIR>          .ipynb_checkpoints
11/07/2021  01:12 PM             5,297 00_Exercise_Python_and_Jupyter_notebook.ipynb
11/09/2021  08:39 AM             9,688 00_Jupyter_Notebooks.ipynb
11/09/2021  08:38 AM             8,916 00_Jupyter_Notebooks_Empty.ipynb
11/09/2021  09:28 AM             3,909 01_Reading_Data.ipynb
11/07/2021  01:14 PM             3,863 02_Exercise_Inspecting_Movies_Data.ipynb
11/07/2021  10:21 AM             7,031 02_Inspecting_Data.ipynb
11/07/2021  10:21 AM             6,678 02_Inspecting_Data_Empty.ipynb
11/07/2021  12:48 PM             5,145 03_Basic_manipulations.ipynb
11/05/2021  01:48 PM             4,771 03_Basic_manipulations_Empty.ipynb
11/05/2021  01:48 PM        

### Let's read in some Titanic data with `pd.read_csv()`. It's common practice to assign the result to a variable called `df`

In [4]:
df = pd.read_csv('titanic.csv')

### What did we just create here? We can use the general python function `type()` to get info what type of object this is:

In [5]:
type(df)

pandas.core.frame.DataFrame

### Let's see what we got when we did this and inspect the first lines with `df.head()`. This is a method that is available on dataframes and series.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [6]:
df.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


### Or check the last lines with `df.tail()`

In [7]:
df.tail()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


### What is the shape of this dataframe? We can use the attribute `.shape`

In [8]:
df.shape

(891, 15)

### What are the names of all columns? We can see that with another attribute called `.columns`

In [9]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

### As you can see pandas calls this an Index (which contains all column names)

### And while we're at it, let's also check the index of this dataframe with the attribute .index

In [10]:
df.index

RangeIndex(start=0, stop=891, step=1)

### DataFrames have many methods and attributes, you can check them with tab completion

In [None]:
df.

### Let's see what the dataframe looks like in general by using dataframe method `.info()` 

### Reading in data with `pd.read_csv()` went very easy (maybe too easy?). Let's check what arguments are available for this function, using `Shift + Tab` inside the function.

### Can we maybe get a short numerical summary of the data? Yes, we can, with: `df.describe()`

### To get quick info about a column of the counts we can do `df['column_name'].value_counts(dropna=False)`

male      577
female    314
Name: sex, dtype: int64

## New concepts discussed here:
- general pandas methods: `pd.read_csv()`
- attributes of dataframes, such as: `df.shape`, `df.columns`
- and methods of a dataframe: `df.head()`, `df.info()`, `df.describe()`
- get counts of values in a column: df['column_name'].value_counts() 