# More Pandas
This notebook supplements pandas_tutorial.ipynb.

In [26]:
import pandas as pd

## Creating a Pandas DataFrame

### Different ways of specifying the filename
In the read_csv example in pandas_tutorial.ipynb, we used forward slashes in the filename- pd.read_csv('./data/students.csv').
The following examples also load our dataset.

In [27]:
# Notice the double backslashes. This is necessary because a single backslash character is used to
# specify special sequences like \t for a tab character and \n for a new line character.
# Backslash characters will work on Windows but not Linux.
students_df = pd.read_csv('.\\data\\students.csv')

students_df

Unnamed: 0,Student ID,Name,Age,Subject,Year of Study,Country of Origin
0,2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
1,a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea
2,d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan
3,3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau
4,67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria
...,...,...,...,...,...,...
96,a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands
97,3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein
98,716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros
99,34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands


In [28]:
# Here we're using a 'raw' string (notice the r). This removes the need for the double backslash.
students_df = pd.read_csv(r'.\data\students.csv')

students_df

Unnamed: 0,Student ID,Name,Age,Subject,Year of Study,Country of Origin
0,2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
1,a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea
2,d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan
3,3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau
4,67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria
...,...,...,...,...,...,...
96,a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands
97,3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein
98,716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros
99,34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands


### Creating a DataFrame from a Dictionary
You can create a DataFrame by directly passing a dictionary of lists to the pd.DataFrame() constructor, where the keys become column names and the lists become the data.

In [29]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Grade': ['A', 'B', 'C']
}

students_df = pd.DataFrame(data)

students_df

Unnamed: 0,Name,Age,Grade
0,Alice,25,A
1,Bob,30,B
2,Charlie,35,C


### Creating a DataFrame from Lists of Lists
Another method is by using a list of lists, each representing a row of data. You'll need to specify the column names separately.

In [30]:
data = [
    ['Alice', 25, 'A'],
    ['Bob', 30, 'B'],
    ['Charlie', 35, 'C']
]

columns = ['Name', 'Age', 'Grade']

students_df = pd.DataFrame(data, columns=columns)

students_df

Unnamed: 0,Name,Age,Grade
0,Alice,25,A
1,Bob,30,B
2,Charlie,35,C


### Creating a DataFrame from a List of Dictionaries
Each dictionary in the list represents a row in the DataFrame, with keys as column names and values as the data for those columns.

In [31]:
data = [
    {'Name': 'Alice', 'Age': 25, 'Grade': 'A'},
    {'Name': 'Bob', 'Age': 30, 'Grade': 'B'},
    {'Name': 'Charlie', 'Age': 35, 'Grade': 'C'}
]

students_df = pd.DataFrame(data)

students_df

Unnamed: 0,Name,Age,Grade
0,Alice,25,A
1,Bob,30,B
2,Charlie,35,C


### Creating a DataFrame from JSON
You can load a DataFrame from a JSON string directly using `pd.read_json()`.

In [32]:
json_data = '''
[
    {"Name": "Alice", "Age": 25, "Grade": "A"},
    {"Name": "Bob", "Age": 30, "Grade": "B"},
    {"Name": "Charlie", "Age": 35, "Grade": "C"}
]
'''

students_df = pd.read_json(json_data)

students_df

  students_df = pd.read_json(json_data)


Unnamed: 0,Name,Age,Grade
0,Alice,25,A
1,Bob,30,B
2,Charlie,35,C


Here we 'future proof' the JSON example using `StringIO`:

In [33]:
from io import StringIO

json_data = '''
[
    {"Name": "Alice", "Age": 25, "Grade": "A"},
    {"Name": "Bob", "Age": 30, "Grade": "B"},
    {"Name": "Charlie", "Age": 35, "Grade": "C"}
]
'''

students_df = pd.read_json(StringIO(json_data))

students_df

Unnamed: 0,Name,Age,Grade
0,Alice,25,A
1,Bob,30,B
2,Charlie,35,C


### Creating a DataFrame from a CSV File (with different options)
Besides the basic CSV loading, you can specify additional parameters to handle different data formats and situations.

In [34]:
# Specifying delimiters: here we're explicitly stating that the delimiter is a comma.
# This is the default, but some files that we want to load might use tabs, semicolons, etc.
students_df = pd.read_csv('./data/students.csv',
                          delimiter=',')

students_df

Unnamed: 0,Student ID,Name,Age,Subject,Year of Study,Country of Origin
0,2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
1,a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea
2,d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan
3,3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau
4,67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria
...,...,...,...,...,...,...
96,a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands
97,3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein
98,716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros
99,34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands


In the example below, we assume that the first row of the file contains data values rather than headings. This doesn't apply to our original students.csv file, but it could well apply to other files that we want to load.

In [35]:
# Loading a file without a header row, and specifying column names manually
students_df = pd.read_csv('./data/students_no_headings.csv',
                          header=None,
                          names=['Student ID', 'Name', 'Age', 'Subject',
                                 'Year of Study', 'Country of Origin'])

students_df

Unnamed: 0,Student ID,Name,Age,Subject,Year of Study,Country of Origin
0,2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
1,a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea
2,d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan
3,3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau
4,67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria
...,...,...,...,...,...,...
96,a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands
97,3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein
98,716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros
99,34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands


### Creating a DataFrame from a SQL Database
We can also fetch data from a SQL database directly into a DataFrame. The example below connects to a SQLite database, but the read_sql_query function will work with any valid database connection.

For info, I created this SQLite database by saving the Pandas DataFrame using `to_sql`:
- `students_df.to_sql(name='students', con=connection, index=False)`

In [36]:
import sqlite3
connection = sqlite3.connect('./data/students.db')
students_df = pd.read_sql_query("SELECT * from students", connection)
connection.close()

students_df

Unnamed: 0,Student ID,Name,Age,Subject,Year of Study,Country of Origin
0,2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
1,a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea
2,d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan
3,3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau
4,67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria
...,...,...,...,...,...,...
96,a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands
97,3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein
98,716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros
99,34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands


## Using the index column

We saw in pandas_tutorial.ipynb that we could use `index_col` to set the index column when loading a DataFrame. We can also set it after the DataFrame has been loaded using `set_index`.

In [37]:
students_df = students_df.set_index('Student ID')

students_df

Unnamed: 0_level_0,Name,Age,Subject,Year of Study,Country of Origin
Student ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea
d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan
3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau
67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria
...,...,...,...,...,...
a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands
3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein
716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros
34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands


We can use the index to access a particular record using loc (which stands for location).

In [38]:
students_df.loc['2703f3f0']

Name                 Mr Clifford Watson
Age                                25.0
Subject              English Literature
Year of Study                       1.0
Country of Origin      Saint Barthelemy
Name: 2703f3f0, dtype: object

Of course we really want our index column to contain only unique values, but notice that we get both rows when using the index value of the duplicated row. 

In [39]:
students_df.loc['34b97db2']

Unnamed: 0_level_0,Name,Age,Subject,Year of Study,Country of Origin
Student ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands
34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands


If we want to select more than one record, we can use `loc` in combination with `index` and `isin`.

In [40]:
selected_ids = ['2703f3f0', 'a8040287']
students_df.loc[students_df.index.isin(selected_ids)]

Unnamed: 0_level_0,Name,Age,Subject,Year of Study,Country of Origin
Student ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea


Let's use `loc` to examine rows that meet specified criteria.

In [41]:
students_df.loc[(students_df['Age'] == 22) & (students_df['Subject']=='Engineering')]

Unnamed: 0_level_0,Name,Age,Subject,Year of Study,Country of Origin
Student ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan
0723a275,Suzanne Dickinson,22.0,Engineering,4.0,Uruguay
a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands


We can also use `iloc` (integer location) to select rows.

In [42]:
students_df.iloc[0]

Name                 Mr Clifford Watson
Age                                25.0
Subject              English Literature
Year of Study                       1.0
Country of Origin      Saint Barthelemy
Name: 2703f3f0, dtype: object

Here we get the first three rows. Notice that that we can use the same slicing syntax that we use for lists.

In [43]:
students_df.iloc[0:3]

Unnamed: 0_level_0,Name,Age,Subject,Year of Study,Country of Origin
Student ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea
d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan


## Joining DataFrames
Next we will load a related dataset containing some student grades. The same Student ID is being used. 

In [44]:
grades_df = pd.read_csv('./data/grades.csv', index_col='Student ID')

grades_df

Unnamed: 0_level_0,Grade
Student ID,Unnamed: 1_level_1
2703f3f0,A
a8040287,C
d8da5486,A
3ac1b74d,B
67850858,C
62dd3a69,B
6b22a999,A


Now we `join` the two DataFrames where the Student IDs are equal.

In [59]:
joined_df = students_df.join(grades_df, on='Student ID')

joined_df

Unnamed: 0,index,Student ID,Name,Age,Subject,Year of Study,Country of Origin,Grade
0,0,2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy,A
1,1,a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea,C
2,2,d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan,A
3,3,3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau,B
4,4,67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria,C
...,...,...,...,...,...,...,...,...
96,96,a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands,
97,97,3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein,
98,98,716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros,
99,99,34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands,


Alternatively we could use `merge`.

In [61]:
joined_df = pd.merge(students_df, grades_df, on='Student ID', how='left')

joined_df

Unnamed: 0,index,Student ID,Name,Age,Subject,Year of Study,Country of Origin,Grade
0,0,2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy,A
1,1,a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea,C
2,2,d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan,A
3,3,3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau,B
4,4,67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria,C
...,...,...,...,...,...,...,...,...
96,96,a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands,
97,97,3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein,
98,98,716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros,
99,99,34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands,


For more information on `merge()`, `join()`, and `concat()`, see https://realpython.com/pandas-merge-join-and-concat/

## Reseting the index
If we have a need to use the index column as a regular column, we can reset it with `reset_index`.

In [56]:
students_df = students_df.reset_index()

students_df

Unnamed: 0,index,Student ID,Name,Age,Subject,Year of Study,Country of Origin
0,0,2703f3f0,Mr Clifford Watson,25.0,English Literature,1.0,Saint Barthelemy
1,1,a8040287,Elliott Ward,25.0,Computer Science,4.0,Guinea
2,2,d8da5486,Miss Pauline Dunn,22.0,Engineering,4.0,Afghanistan
3,3,3ac1b74d,Mr Dominic Mason,22.0,Physics,1.0,Palau
4,4,67850858,Mrs Melanie Brown,18.0,English Literature,3.0,Algeria
...,...,...,...,...,...,...,...
96,96,a8be1ec3,Kelly Foster,22.0,Engineering,1.0,Netherlands
97,97,3b69ff22,Sara Austin,19.0,Computer Science,34.0,Liechtenstein
98,98,716fb45f,Miss Grace Miller,22.0,English Literature,4.0,Comoros
99,99,34b97db2,Miss Lydia Saunders,23.0,Physics,2.0,Faroe Islands
