<h1 align="center">Python for DATA SCIENCE</h1><Br/>
<img src="https://goo.gl/ZKX5FF" style="width:15%; float:centre"><Br/>
<h2 align="center">Dr Mazen Gabriel Alhrishy</h2>
<h5 align="center"><i>MAZEN.ALHRISHY@GMAIL.COM</i></h5><Br/>

<table width=25%>
    <tr>
        <td>
            <a href="https://goo.gl/BTtR3C"><img src="https://goo.gl/rMsKok"></a>
        </td>
        <td>
            <a href="https://goo.gl/XaRDbH"><img src="https://goo.gl/KyMZcj"></a>
        </td>
        <td>
            <a href="https://goo.gl/9uCqS6"><img src="https://goo.gl/a8gcDK"></a>
        </td>
        <td>
            <a href="https://goo.gl/bnt2EL"><img src="https://goo.gl/1rT18x"></a>
        </td>
        <td>
            <a href="https://goo.gl/VmfU3S"><img src="https://goo.gl/WFFkxn"></a>
        </td>
    </tr>
</table>

***
# 8- Pandas for Data Analysis

> ## [I- Introduction](#I)
> ## [II- Fundamental data structures](#II)
> ## [III- Selecting data](#III)
> ## [IV- Essential functionality](#IV)
> ## [V- Working with missing data](#V)
> ## [VI- Reading and writing data in text format](#VI)
> ## [VII- Visualization](#VII)

> ### [- Exercises](#exercises)
> ### [- Solutions](#solutions)

***

## I- Introduction <a id='I'></a>

> ## [1. History](#I-1)
> ## [2. Installation](#I-2)
> ## [3. Motivation](#I-3)

### 1- History <a id='I-1'></a>

* Statistician Wes McKinney started working on pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform **quantitative analysis on financial data**. By the end of 2009 it had been open sourced. By 2010, McKinney left AQR to pursue a PhD in statistics, leaving him little time to work on improving Pandas

<img src="https://goo.gl/JM4fN6" style="width:30%; border-radius:50%; float:left; padding:20px 30px 20px 30px"/>

<br><br>
“I felt that Python as a language was facing an existential crisis... Python was either going to become relevant as a statistical computing language or it wasn’t, and I felt it had so much potential. I decided to drop out of graduate school to work on Pandas as much as possible...”

— Wes McKinney (Creator and Benevolent Dictator For Life)

— Author of __[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)__

<br><br><br><br><br>
* The library’s name derives from **pan**el **da**ta, a common term for multidimensional data sets used in statistics and econometrics


* Pandas is written primarily in pure Python, but it also makes heavy use of NumPy and other extension code to provide good performance even for large panels


* __[Pandas website](https://pandas.pydata.org/)__

### 2- Installation <a id='I-2'></a>

* Pandas requires a number of dependencies. The Anaconda Python distribution already provides pandas built-in

* If you've created a basic virtual environment, you can get pandas using conda:

In [None]:
! conda install pandas --y

* To verify the package was installed

In [None]:
! conda list

* To import into a Python script

In [None]:
import pandas as pd

### 3- Motivation <a id='I-3'></a>

* Pandas enables you to carry out your **entire data analysis workflow** in Python without having to switch to a more domain specific language like R. This includes: 
    - Data munging and preparation<sup>1</sup>
    - Data analysis<sup>2</sup>
    - Data modeling<sup>3</sup>


<sup>1</sup>Something to keep in mind is that in pandas, data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly

<sup>2</sup>Although pandas uses/adopts NumPy, the biggest difference is that, pandas is developed to deal with heterogeneous data; whereas Numpy is more suited to deal with homogeneous numerical array data

<sup>3</sup>pandas does not implement significant modelling functionality outside of linear and panel regression

***
## II- Fundamental data structures <a id="II"></a>

There are two primary data structures in pandas you need to know about:

> ### [1- Series](#II-1)
> ### [2- DataFrame](#II-2)

### 1- Series <a id='II-1'></a>

* A __[series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)__ is a **1d array-like** structure. Like a normal array, Series contains a **sequence of values**, however in addition, each value also has an associated label that is also called an **index**


* The basic method to create a Series object is:

> s = pd.Series(data)

Data can be:

**_1- List of values_**

In [None]:
s1 = pd.Series([4, 7, -5])
s1

* The first column represents the index while the second represent the values

* Here, an index was automatically created and assigned to each value. However, we can specify indexes with the **index** argument

In [None]:
s1 = pd.Series([4, 7, -5], index=['a', 'b', 'c'])
s1

* There are several types of indexes that we can use. For example, for fixed frequency dates, we can use:

In [None]:
dates = pd.date_range('20180618', periods=3)
s1 = pd.Series([4, 7, -5], index=dates)
s1

**_2- Python dictionary_**

In [None]:
s2 = pd.Series({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000})
s2

* The index in the resulting Series will be the dict's keys in sorted order. If we want a specific order we can provide the **index** argument

In [None]:
s2 = pd.Series({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}, index=['Texas', 'Ohio', 'Oregon', 'Utah'])
s2

**_3- Numpy array_**

In [None]:
import numpy as np

s3 = pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
s3

### 2- DataFrame <a id='II-2'></a>

A __[DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)__ is an **nd array-like** structure. It contains **columns of values**<sup>+</sup>. Each value has associated **row index** and **column index** (think of it as a spreadsheet)

<sup>+</sup> Think of a DataFrame as a dict of Series, all sharing the same index

* The basic method to create a DataFrame object is:

> df = pd.DataFrame(data)

Like Series, DataFrame accepts many different kinds of input. The most common ones are:

**_1- A dict of equal-length lists or NumPy arrays:_**

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],  # values as a list
       'year': [2000, 2001, 2002, 2001, 2002, 2003],  # values as a NumPy array
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}  # values as a Series

df1 = pd.DataFrame(data)
df1

* The resulting DataFrame has automatically created and assigned the row index, and the **columns will be sorted** in order. Similar to Series we can specify the indexes using **index**. We can also specify the order of columns by using the **columns** argument

In [None]:
df1 = pd.DataFrame(data, index=['one', 'two', 'three', 'four', 'five', 'six'], columns=['year', 'state', 'pop'])
df1

**_2- A dict of dicts_**

In [None]:
pop = {'Nevada':{2001: 2.4, 2002: 2.9},
       'Ohio':{2000: 1.5, 2001: 1.7, 2002: 3.6}}

df2 = pd.DataFrame(pop)
df2

* The **outer dict keys** are interpreted **as columns indexes** and the **inner keys as row indexes**. Also, we can see that any missing values are filled with **NaN** (Not a Number) value

> In both Series and DataFrame, you can get values using the **values** attribute (represented as a numpy object), and the row indexes using the **index** attribute (as an index object). You can also get the column indexes for a DataFrame using the **columns** attribute (as an index object)

In [None]:
print(s2.values)
print(s2.index)

In [None]:
print(df2.values)
print(df2.index)
print(df2.columns)

***
## III- Selecting data <a id='III'></a>

There are a large number of ways to select data. We will cover the most common ones

        Keep in mind that the result will always be:
        - A NumPy scaler if a single value is selected
        - A Series if multiple values from one row/column are selected (i.e. 1d array)
        - A DataFrame if multiple values from multiple rows/columns are selected (i.e. nd array)

        Remembering this will help you to know how to process the result!

> ### [1- Columns selection](#III-1)
> ### [2- Rows selection](#III-2)
> ### [3- Combined selection](#III-3)
> ### [4- Boolean selection](#III-4)

* Let's consider the following DataFrame and Series

In [None]:
df = pd.DataFrame(np.arange(16).reshape(4, 4), index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                  columns=['one', 'two', 'three', 'four'])

s = pd.Series([1, 5, 9, 13], index=['Ohio', 'Colorado', 'Utah', 'New York'])

### 1- Columns selection <a id='III-1'></a>

> - A **single column** (i.e. 1d array) can be selected **as a Series** or **as a DataFrame**  
- **Multiple columns** (i.e. nd array) can only be selected **as a DataFrame**

**_1- For single column_**

* A **single column** can be selected **as a Series** either by dict-like notation or by attribute

In [None]:
df

In [None]:
col2 = df['two'] # dict notation which is equivalent to selecting by attribute df.two
col2

* The **column** can also be selected **as a DataFrame**

In [None]:
col2 = df[['two']]
col2

**_2- For Multiple columns_**

* **Multiple columns** can only be selected **as a DataFrame**

In [None]:
df

In [None]:
col23 = df[['two', 'three']]
col23

### 2- Rows selection <a id='III-2'></a>

> The method to select rows depends on the data structure:  
- A **Series** behaves in similar way to NumPy for indexing  
- A **DataFrame** does not behaves like NumPy. For that we need to use the special indexing operators **iloc** and **loc**

**_1- For Series_**

* In Series, rows can be indexed in a similar way to NumPy using integers (end-point is exclusive)

In [None]:
s

To get the second row, we can use its **index** (i.e. integer-indexing)

In [None]:
row2 = s[1]
row2

To get first 3 rows, we can slice by **index** as well (i.e. integer-slicing)

In [None]:
row13 = s[:3]
row13

* Another option to select rows in Series is by **label** (end-point is inclusive)

To get the second row, we can use its **label** (i.e. label-indexing)

In [None]:
row2 = s['Colorado']
row2

To get first 3 rows, we can slice by **label** as well (i.e. label-slicing)

In [None]:
row13 = s['Ohio':'Utah']
row13

**_2- For DataFrame_**

* The special indexing operators **iloc** and **loc** enable selecting rows from a DataFrame with NumPy like notation using either: 
    - iloc integer-indexing
    - loc label-indexing

In [None]:
df

**_1- iloc integer-indexing_**

To get the second row we can use its **integer** position with **iloc**

In [None]:
row2 = df.iloc[1]
row2

To get the first 3 rows, we can slice by **integer** position with **iloc**

In [None]:
row13 = df.iloc[:3]  # slicing is also possible with normal integer-slicing (df[:3])
row13

**_2- loc label-indexing_**

To get the second row, we can use its **label with loc**

In [None]:
row2 = df.loc['Colorado']
row2

To get first 3 rows, we can slice by **label with loc**

In [None]:
row13 = df.loc['Ohio':'Utah']  # slicing is also possible with normal label-slicing (df['Ohio':'Utah'])
row13

### 3- Combined selection <a id='III-3'></a>

> Now that we know the basics of column/row selections, we can combine these in different ways to get the desired values. Remember, the result will always be:
        - A NumPy scaler if a single value is selected
        - A Series if multiple values from one row/column are selected (i.e. 1d array)
        - A DataFrame if multiple values from multiple rows/columns are selected (i.e. nd array)

**_1- Selecting a single value_**

* Second column, second row

In [None]:
df

In [None]:
print(df['two'][1])  # get column 'two' as Series > get second element 
print('---')
print(df.iloc[1, 1])  # get second row as Series > get second element (df.iloc[1][1])
print('---')
print(df.loc['Colorado', 'two']) # get 'Colorado' row as Series > get 'two' element (df.loc['Colorado'].loc['two'])

**_2- Selecting multiple values from one row/column_**

*  Second column, first 3 rows 

In [None]:
df

In [None]:
print(df['two'][:3])  # df['two'] then [:3]
print('---')
print(df.iloc[:3, 1])  # df.iloc[:3].T then iloc[1]
print('---')
print(df.loc[:'Utah', 'two'])  # df.loc[:'Utah'] then ['two']

**_3- Selecting multiple values from multiple rows/columns_**

*  Second to third column, first 3 rows

In [None]:
df

In [None]:
print(df[['two', 'three']][:3])  # df[['two', 'three']] then [:3]
print('---')
print(df.iloc[:3, 1:3])  # df.iloc[:3].T then iloc[1:3].T
print('---')
print(df.loc[:'Utah', 'two':'three'])  # df.loc[:'Utah'] then [['two','three']]

### 4- Boolean selection <a id='III-4'></a>

> All previous selections can be used with Boolean conditions to create a Truth mask (True, False). Depending on the selection made, masks can be either:
    - Series masks
    - DataFrame masks

**_1-Using a Series mask_**

In [None]:
df

* Select rows where second column > 5 

In [None]:
mask = df['two'] > 5
print(mask)
print('---')
print(df[mask])  # mask index aligned with the DataFrame index (should be applied column-wise)

The **isin()** method can be used to specify a list of values for the mask  
Select rows where second column is 5 or 9 

In [None]:
mask = df['two'].isin([5, 9])
print(mask)
print('---')
print(df[mask])

* Select columns where second row > 5

In [None]:
df

In [None]:
mask = df.iloc[1] > 5  # equivalent to df.loc['Colorado'] > 5
print(mask)
print('---')
print(df.loc[:, mask]) # mask index aligned with the DataFrame columns (should be applied row-wise)

**_2-Using a DataFrame mask_**

In [None]:
df

* Select all values which are > 5

values where mask is False are replaced with NaN

In [None]:
mask = df > 5
print(mask)
print('---')
print(df[mask])

* Select values in the second and third columns which are > 5 

In [None]:
mask = df[['two', 'three']] > 5 
print(mask)
print('---')
print(df[mask])

* Select values in the first 3 rows, second to third column which are > 5

In [None]:
mask = df.iloc[:3, 1:3] > 5  # equivalent to df.loc[:'Utah', 'two':'three']
print(mask)
print('---')
print(df[mask])

***
## IV- Essential functionality <a id='IV'></a>

> ### [1- Reindexing](#IV-1)
> ### [2- Sorting](#IV-2)
> ### [3- Basic arithmetic](#IV-3)
> ### [4- NumPy universal functions (ufunc)](#IV-4)
> ### [5- Computing descriptive statistics](#IV-5)
> ### [6- Function application and mapping](#IV-6)

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape(3, 3), index=['d', 'a', 'b'], 
                  columns=['A', 'B', 'C'])

df2 = pd.DataFrame(np.arange(9.,18.).reshape((3, 3)), index=['b', 'd', 'c'],
                   columns=['C', 'A', 'E'])

s1 = pd.Series([0.0, 1.0, 2.0], index=['D', 'A', 'E'])

s2 = pd.Series([6.0, 1.0, 4, 3.0], index=['D', 'A', 'B', 'C'])

### 1- Reindexing <a id="IV-1"></a>

> A Pandas object can be reindexed with the **reindex()** method. A new object is returned with the data rearranged according to the new index. Missing values are replaced with NaN

In [None]:
df1

In [None]:
df = df1.reindex(index=['a', 'b', 'c', 'd'])
df

* The columns can also be reindexed with the **columns** argument

In [None]:
df2

In [None]:
df = df2.reindex(index=['a', 'b', 'c', 'd'], columns=['A', 'B', 'C', 'D', 'E'])
df

### 2- Sorting <a id="IV-2"></a>

> Pandas objects can be sorted either:
    - By index using the sort_index()
    - By value using the sort_values()  
    
Both methods return a new object sorted in an ascending order by default

In [None]:
s2

* Sort **Series** by index

In [None]:
s2.sort_index()

* Sort **Series** by value in a descending order 

In [None]:
s2.sort_values(ascending=False)

* To sort a **DataFrame** by value, we need to specify a column to use its data for sorting. This is done using the **by** argument

In [None]:
df2

In [None]:
df2.sort_values(by='A', ascending=False)

* Multiple columns can also be used for sorting by values in DataFrames. Data is sorted according to the first column values, in case of a tie, the second column values are used to sort, and do on

In [None]:
df2.loc['c', 'A'] = 13.0  # introduce a tie
df2

In [None]:
df2.sort_values(by=['A', 'C'])  # sort by 'A' values first. In case of duplicates in 'A', sort by 'C' values 

* For **DataFrames**, we can also sort columns by setting the **axis** argument value to **1** (i.e. along columns)

In [None]:
df2

In [None]:
df2.sort_index(axis=1, ascending=False)

### 3- Basic arithmetic <a id="IV-3"></a>

> When performing basic arithmetic between 2 Pandas objects, index pairs will always be aligned. If index pairs are not the same, the result will be their union (similar to outer join in database handling) replacing missing values with NaN. Arithmetic can be:
    - Between 2 Series
    - Between 2 DataFrames
    - Between a Series and a DataFrame
    
* Basic arithmetic operators are: $+$, $-$, $*$, $/$, $//$, $\%$, $**$

**_1-Between 2 Series_**: alignment is performed along rows

In [None]:
print(s1)
print('---')
print(s2)

In [None]:
s1 - s2

**_2-Between 2 DataFrames_**: alignment is performed along both rows and columns

In [None]:
print(df1)
print('---')
print(df2)

In [None]:
df1 + df2

**_3-Between a Series and a DataFrame_**: the Series index is aligned with the DataFrame columns, and arithmetic is performed row-wise (similar to NumPy behaviour known as broadcasting)

In [None]:
print(df1)
print('---')
print(s2)

In [None]:
df1 / s2

### 4- NumPy universal functions (ufunc) <a id="IV-4"></a>

> Because Pandas uses NumPy to enhance performance, all NumPy ufunc we talked about before (a function that operates on nd-arrays **element-wise**) can be used on Pandas objects with built-in handling for missing data. The result will be another Pandas object with the **indexes preserved**

* In fact, all basic arithmetic above utilizes NumPy math ufunc. For example when adding the 2 DataFrames above, the **add()** ufunc was called under the hood!

In [None]:
df1.add(df2)

* Some example ufunc

In [None]:
np.square(df1)  # square the DataFrame

In [None]:
np.isnan(df1 - df2)  # truth value whether NaN exists or not

In [None]:
np.floor(df1 / s1)

### 5- Computing descriptive statistics <a id="IV-5"></a>

> Pandas also takes advantage of NumPy to provide a set of statistical methods with built-in handling for missing data. These can be categorized into:
    - Reduction statistics
    - Summary statistics
    
    NaN are excluded by defualt

In [None]:
df3 = pd.DataFrame({'one':{'a': 1.4, 'b': 7.1, 'c': np.nan, 'd': 0.75},
                   'two':{'a': np.nan, 'b': -4.5, 'c': np.nan, 'd': -1.3}})

**_1-Reduction statistics_**: these extract a single value from a Series, or a Series of values from a DataFrame

In [None]:
df3

* Extrema reductions

In [None]:
df3.min()  # return a Series containing columns min

In [None]:
df3.idxmin()  # return indexes of min values

In [None]:
df3.max(axis=1)  # return a Series containing rows max

In [None]:
df3.idxmax(axis='columns')  # return indexes of max values

* Statistical reductions: these are the most common used ones but more exist

In [None]:
df3.mean()  # return a Series containing columns mean

In [None]:
df3.median()  # return a Series containing columns median

In [None]:
df3.std()  # return a Series containing columns standard deviation

In [None]:
df3.count()  # return a Series containing columns number of non-NaN values

In [None]:
df3.sum()  # return a Series containing columns sum

In [None]:
df3.cumsum()  # return a DataFrame of accumulations

**_2-Summary statistics_**: which produce multiple summary statistics in one shot

In [None]:
df3

In [None]:
df3.describe()

### 6- Function application and mapping <a id="IV-6"></a>

> A very useful operation is to apply a custom/anonymous function on a Series or DataFrame. This can be done element-wise, or column-wise using one of the following methods:
    - map() is used to apply a function, element-wise, on a Series
    - apply() is used to apply a function, column-wise, on a DataFrame
    - applymap() is used to apply a function, element-wise, on a DataFrame

**_1-map()_**: apply a function that adds 1, element-wise, on a Series

In [None]:
s1

In [None]:
s1.map(lambda x: x + 1)

**_2-apply()_**: apply a function that subtracts the max from the min, column-wise, on a DataFrame

In [None]:
df1

In [None]:
df1.apply(lambda x: x.max() - x.min())

This can also be done row-wise for DataFrames using the **columns** argument

In [None]:
df1.apply(lambda x: x.max() - x.min(), axis='columns')

**_3-applymap()_**: apply a function that subtracts the first column's mean, element-wise, on a DataFrame

In [None]:
mean = df1['A'].mean()
df1.applymap(lambda x: x - mean)

***
## V- Working with missing data <a id="V"></a>

> There are plenty of methods in Pandas for dealing with missing data. Here are few common ones

> ### [1- Filling missing values: fillna](#V-1)
> ### [2- Dropping axis with missing data: dropna](#V-2)
> ### [3- Interpolation at missing data: interpolate](#V-3)

In [None]:
df4 = pd.DataFrame({'one':{'a': 1.4, 'b': 7.1, 'c': np.nan, 'd': 0.75},
                   'two':{'a': np.nan, 'b': -4.5, 'c': np.nan, 'd': -1.3},
                   'three':{'a': np.nan, 'b': np.nan, 'c': 0.57, 'd': -0.44},
                   'four':{'a': 3, 'b': 2.7, 'c': 0.63, 'd': 2.34}})

### 1- Filling missing values: fillna  <a id="V-1"></a>

> The method **fillna()** can be used to fill in missing values with either:
    - Scaler
    - Forward or backward value
    - Pandas object

In [None]:
df4

**_1- Scalar filling_**: any value can be used to replace NaN

In [None]:
df4.fillna(0)

In [None]:
df4.fillna('NULL')

This can be done in most functions that might result in NaN by using the **fill_value** argument. For example, reindex(fill_value=0)

**_2- Forward or backward filling_**: values before or after NaN can be used to replace it. For that, the **method** argument is provided for the **fillna()**

For forward filling, **pad** or **ffill** can be assigned to for **method**

In [None]:
df4.fillna(method='ffill')  # equivalent to df4.ffill()

For backword filling, **bfill** or **backfill** can be assigned to **method** 

In [None]:
df4.fillna(method='bfill')  # equivalent to df4.bfill()

Most functions that might result in NaN has a **method** argument. For example, reindex(method=ffill)

**_3- Pandas object filling_**: usually using a Series. The index of the Series must match the columns of the DataFrame we wish to fill

For example, we can pass the DataFrame means as a Series to the **fillna** method. This will replace NaN in each column with the mean of that column (or median(), max(), min(), std(), etc)

In [None]:
df4

In [None]:
df4.fillna(df4.mean())

### 2- Dropping axis with missing data: dropna  <a id="V-2"></a>

> The method **dropna()** can be used to simply remove rows/columns which refer to missing data

In [None]:
df4

Drop all rows that contain NaN

In [None]:
df4.dropna(axis=0)  # equivalent to df4.dropna()

Drop all columns that contain NaN

In [None]:
df4.dropna(axis=1)

### 3- Interpolation at missing data: interpolate  <a id="V-3"></a>

> Calling **interpolate()** on a Pandas object, will, by default, perform linear interpolation at missing datapoints

In [None]:
df4

In [None]:
df4.interpolate()

***
## VI- Reading and writing data in text format <a id="VI"></a>

> ### [1- Reading data](#VI-1)
> ### [2- Writing data](#VI-2)

### 1- Reading data <a id="VI-1"></a>

> A large number of functions for reading tabular data **as a DataFrame** exist in pandas. The most common ones are: 
    - read_csv() for reading delimited data from a csv file, 
    - read_excel() for reading tabular data from an excel file
    - read_table() for reading delimited data from a file, URL, or a file-like object
    
* Most of these functions have a large number of optional arguments. We will look into few of these arguments

Let's start with a small csv file ex1.csv, which has a header row (the first row in the file)  
> To view a csv file we can use the **cat** command on Linux or the **type** command on Windows

In [None]:
! cat Examples/ex1.csv  # for windows: ! type Examples\\ex1.csv

The basic method to read the file is without any arguments

In [None]:
df = pd.read_csv(r'Examples/ex1.csv')  # for windows: r'Examples\\ex1.csv'
df

By default, the function assigns indexes and reads the first row in the file as the column names  
However, some files don't have a header row such as ex2.csv

In [None]:
! cat Examples/ex2.csv

In this case we can either: 
- Let Pandas assign default column names by using the **header** argument
- Or specify column names ourselves by using the **names** argument

In [None]:
import pandas as pd
df = pd.read_csv(r'Examples/ex2.csv', header=None)
df

In [None]:
df = pd.read_csv(r'Examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
df

Instead of the default assigned indexes, we can specify one of the columns to be the index (by name or by integer) using the **index_col** argument  
In ex2.csv, we want the 'message' column to be the index

In [None]:
df = pd.read_csv(r'Examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'], index_col='message')  # or index_col=4
df

We can skip a list of lines when reading the file using the **skiprows** argument  
In ex4.csv, we want to skip the first, third, and fourth lines

In [None]:
!cat Examples/ex4.csv

In [None]:
df = pd.read_csv(r'Examples/ex4.csv', skiprows=[0, 2, 3])
df

Pandas by default replaces missing data with NaN  
ex5.csv has one empty string, and one NA value which Pandas replaces with NaN

In [None]:
!cat Examples/ex5.csv  # for windows: ! type Examples\\ex5.csv

In [None]:
df = pd.read_csv('Examples/ex5.csv')
df

However, we can handle missing data ourselves using the **na_values** argument

In [None]:
df = pd.read_csv('Examples/ex5.csv', na_values=['NULL', 'foo'])
df

If we want to read only a specific number of rows, we can use the **nrows** argument  
In ex6.csv, we want to only read the first 5 rows

In [None]:
df = pd.read_csv('Examples/ex6.csv', nrows=5)
df

### 2- Writing data <a id="VI-2"></a>

> Data can be written to a large number of delimited formats. The most common one is the csv file format

* Let's read some data from ex6.csv then write it into a csv file

In [None]:
data = pd.read_csv('Examples/ex6.csv', nrows=5)

The basic method to write to a file is without any arguments

In [None]:
data.to_csv('Examples/out.csv')
!cat Examples/out.csv

By default, the delimiter is a comma, however, we can use other delimiters using the **sep** argument

In [None]:
data.to_csv('Examples/out.csv', sep='|')
!cat Examples/out.csv

By default, both the index and column names are written. Both of these can be disabled using the **index** and **header** arguments

In [None]:
data.to_csv('Examples/out.csv', index=False, header=False)
!cat Examples/out.csv

We can also write only a subset of the columns, and in an order we want

In [None]:
data.to_csv('Examples/out.csv', index=False, columns=['key', 'two', 'four'])
!cat Examples/out.csv

***
## VII- Visualization <a id="VII"></a>

> Series and DataFrame have a **plot()** method for making some basic plot types. By default, **plot()** makes line plots

In [None]:
import matplotlib
matplotlib.use('nbagg')

In [None]:
import matplotlib.pyplot as plt

In [None]:
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0),
                  columns=['A', 'B', 'C', 'D'],
                  index=np.arange(0, 100, 10))
df.plot()
plt.show()

> A handful of plotting styles can be used by providing the **kind** argument to **plot()**. Some include:

    - 'bar' or 'barh' for bar plots
    - 'hist' for histogram
    - 'box' for boxplot
    - 'scatter' for scatter plots
    - 'hexbin' for hexagonal bin plots
    - 'pie' for pie plots

* For example, to plot a bar chart

In [None]:
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])

df.plot(kind='bar')
plt.show()

Similar to Matplotlib, we can explicitly create and keep track of the Axes objects

In [None]:
ax = df.plot(kind='bar')
ax.set_title('Axes title')
ax.set_ylabel('Y-Axis label')
ax.set_xlabel('X-Axis label')
plt.show()

#### - Exercises <a id='exercises'></a>
> Modified from dataCamp.com

I- In this exercise you will be working with vehicle data from different countries. Three lists are defined for you:
    - names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']  
    - names: containing the country names for which the data is available  
    - dr =  [True, False, False, False, True, True, True]  
    - dr: a list with booleans that tells whether people drive left or right in the corresponding country  
    - cpc = [809, 731, 588, 18, 200, 70, 45]  
    - cpc: the number of motor vehicles per 1000 people in the corresponding country

1. Create a dictionary my_dict with three key:value pairs:
    - key 'country' and value names
    - key 'drives_right' and value dr
    - key 'cars_per_cap' and value cpc

2. Build a DataFrame cars from my_dict

3. Check the index and columns of the DataFrame

4. Override the default indexes with this list of indexes and recheck the DataFrame new index
    - row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

II- In the Example folder you have cars.csv

1. Read this file as a DataFrame, and store it as cars  

2. Check the index and columns of the DataFrame. Does it look correct?  

3. Read the file again, but this time specify the first column in the file as an index for the DataFrame. Check the index now  

4. Select the 'country' column of cars as a Pandas Series. Print result

5. Select the 'country' column of cars as a Pandas DataFrame. Print result

6. Select both the 'country' and 'drives_right' columns of cars. Print result

7. Select the first 3 rows from cars. Print result

8. Select the fourth, fifth and sixth rows (corresponding to indexes 3, 4 and 5). Print result

9. Select the row corresponding to Japan as a Series (its index is 'JAP'). Print result 

10. Select the observations for Australia and Egypt as a DataFrame (their indexes are 'AUS', 'EG'). Print result

11. Select the 'drives_right' value of the row corresponding to Morocco (its index is 'MOR')

12. Select a sub-DataFrame, containing the rows for Russia and Morocco and the columns 'country' and 'drives_right' (Russia index is 'RU')

#### - Solutions <a id='solutions'></a>

Question I

In [None]:
import pandas as pd

names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

In [None]:
# 1- create dictionary my_dict
my_dict = {'country': names, 'drives_right': dr, 'cars_per_cap': cpc}

In [None]:
# 2- build a DataFrame cars
cars = pd.DataFrame(my_dict)

In [None]:
# 3- check index and columns
print(cars)
print(cars.index)
print(cars.columns)

In [None]:
# 4- override default indexes and recheck
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels
print(cars.index)
print(cars)

Question II

In [None]:
# 1- read the file into a DataFrame
cars = pd.read_csv(r'Examples/cars.csv')

In [None]:
# 2- check index and columns 
print(cars.index)
print(cars.columns)

In [None]:
# 3- read again with the first column as the index 
cars = pd.read_csv(r'Examples/cars.csv', index_col = 0)
print(cars.index)
print(cars.columns)

In [None]:
# 4- select the 'country' column as Series
print(cars['country'])

In [None]:
# 5- select the 'country' column of as DataFrame
print(cars[['country']])

In [None]:
# 6- select both 'country' and 'drives_right' columns
print(cars[['country', 'drives_right']])

In [None]:
# 7- select the first 3 rows
print(cars.iloc[[0, 1, 2]])

In [None]:
# 8- select the fourth, fifth and sixth rows
print(cars.iloc[3:6])

In [None]:
# 9- select the row corresponding to Japan as a Series
print(cars.loc[['JAP']])

In [None]:
# 10- select the rows corresponding for Australia and Egypt
print(cars.loc[['AUS', 'EG']])

In [None]:
# 11- select the 'drives_right' value of the row corresponding to Morocco
print(cars.loc[['MOR'],['drives_right']])

In [None]:
# 12- select a DataFrame containing the rows for Russia and Morocco and the columns 'country' and 'drives_right'
print(cars.loc[['RU', 'MOR'],['country', 'drives_right']])