In [1]:
import pandas as pd # A general purpose Python library for data analysis
import numpy as np # A library for scientific computing in Python (e.g., provides high-performance multi-dimensional array objects and operations)

import matplotlib.pyplot as plt # a plotting library for Python and NumPy (readily customizable)
import seaborn as sns # Another plotting library for Python (fewer syntax, excellent default themes, behind the scenes, it uses matplotlib)
import time

## Knowledge Streams 2024

In this notebook, we will learn about the key data structures provided by the Pandas library: **Data Frames, Series, and Indices**.

In addition, we will learn about the following operations:
* How to access data contained in these structures?
* How to read files (e.g., csv, xlsx, sql) to create these structures?
* How to carry out different data manipulation tasks using these structures?

`Dataset`: US elections with information about candidates, their party, votes won, year of election and the result.

## Reading in Data Frames from Files
We'll be using **read_csv** today. Note that this file reading function does all the *data parsing* for you, which is very useful.

Before loading a file into a dataframe, let's first take a look at the **elections.csv** file

In [3]:
#Load csv file and print shape
# Code here]
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/My Drive/elections.csv')
shape = df.shape
print(shape)

# how many observation and features are given
#Code here
observations = shape[0]
features = shape[1]
print(f"There are {observations} observations and {features} features")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
(182, 6)
There are 182 observations and 6 features


In [4]:
# We can use the **head command** to show only a few rows of a dataframe from start.
# Code here
print(df.head(10))

#Use **tail command** to show last few observation.
# code here
print(df.tail(10))

   Year               Candidate                  Party  Popular vote Result  \
0  1824          Andrew Jackson  Democratic-Republican        151271   loss   
1  1824       John Quincy Adams  Democratic-Republican        113142    win   
2  1828          Andrew Jackson             Democratic        642806    win   
3  1828       John Quincy Adams    National Republican        500897   loss   
4  1832          Andrew Jackson             Democratic        702735    win   
5  1832              Henry Clay    National Republican        484205   loss   
6  1832            William Wirt           Anti-Masonic        100715   loss   
7  1836       Hugh Lawson White                   Whig        146109   loss   
8  1836        Martin Van Buren             Democratic        763291    win   
9  1836  William Henry Harrison                   Whig        550816   loss   

           %  
0  57.210122  
1  42.789878  
2  56.203927  
3  43.796073  
4  54.574789  
5  37.603628  
6   7.821583  
7  10.0059

In [6]:
#The `read_csv` command lets us specify a **column to use an index**. For example, we could have used __Year__ as the index.
#Code here
data = pd.read_csv('/content/drive/My Drive/elections.csv', index_col="Year")
print(data)

              Candidate                  Party  Popular vote Result          %
Year                                                                          
1824     Andrew Jackson  Democratic-Republican        151271   loss  57.210122
1824  John Quincy Adams  Democratic-Republican        113142    win  42.789878
1828     Andrew Jackson             Democratic        642806    win  56.203927
1828  John Quincy Adams    National Republican        500897   loss  43.796073
1832     Andrew Jackson             Democratic        702735    win  54.574789
...                 ...                    ...           ...    ...        ...
2016         Jill Stein                  Green       1457226   loss   1.073699
2020       Joseph Biden             Democratic      81268924    win  51.311515
2020       Donald Trump             Republican      74216154   loss  46.858542
2020       Jo Jorgensen            Libertarian       1865724   loss   1.177979
2020     Howard Hawkins                  Green      

In [9]:
#Alternately, we could have used the **set_index** commmand on the dataframe to set a particular column as index.
# code here
data.reset_index(inplace=True)
data.set_index("Year")

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...
2016,Jill Stein,Green,1457226,loss,1.073699
2020,Joseph Biden,Democratic,81268924,win,51.311515
2020,Donald Trump,Republican,74216154,loss,46.858542
2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


# Caution:
The **set_index command** (along with all other data frame methods) **does not modify the dataframe**, i.e., the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe (e.g., `elections.set_index("Party",inplace=True)`).

## Duplicate Columns?
By contast, column names MUST be unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically any duplicates. Load duplicate_columns.csv

In [12]:
#Answer Here
duplicate_file = pd.read_csv('/content/drive/My Drive/duplicate_columns.csv')
print(duplicate_file)

    name    name.1      flavor
0   john     smith     vanilla
1  zhang      shan   chocolate
2  fulan  alfulani  strawberry
3   hong   gildong      banana


## The [ ] Operator & Indexing

The DataFrame class has an indexing operator **[ ]** (also known as the 'brack' operator) that lets you do a variety of different things. If your provide a String to the **[ ]** operator, you get back a ***Series*** corresponding to the requested label.

1.Use **[ ]** to display different columns

2.Use List retrive multiple columns

In [26]:
# Display and Retrieve multiple columns from the election data frame, the resultant would be the list for every column.
#Code here
column_list1 = data['Candidate'].tolist()
column_list2 = data['Party'].tolist()
print(column_list1)
print(column_list2)

['Andrew Jackson', 'John Quincy Adams', 'Andrew Jackson', 'John Quincy Adams', 'Andrew Jackson', 'Henry Clay', 'William Wirt', 'Hugh Lawson White', 'Martin Van Buren', 'William Henry Harrison', 'Martin Van Buren', 'William Henry Harrison', 'Henry Clay', 'James Polk', 'Lewis Cass', 'Martin Van Buren', 'Zachary Taylor', 'Franklin Pierce', 'John P. Hale', 'Winfield Scott', 'James Buchanan', 'John C. Frémont', 'Millard Fillmore', 'Abraham Lincoln', 'John Bell', 'John C. Breckinridge', 'Stephen A. Douglas', 'Abraham Lincoln', 'George B. McClellan', 'Horatio Seymour', 'Ulysses Grant', 'Horace Greeley', 'Ulysses Grant', 'Rutherford Hayes', 'Samuel J. Tilden', 'James B. Weaver', 'James Garfield', 'Winfield Scott Hancock', 'Benjamin Butler', 'Grover Cleveland', 'James G. Blaine', 'John St. John', 'Alson Streeter', 'Benjamin Harrison', 'Clinton B. Fisk', 'Grover Cleveland', 'Benjamin Harrison', 'Grover Cleveland', 'James B. Weaver', 'John Bidwell', 'John M. Palmer', 'Joshua Levering', 'William J

In [18]:
#The **[ ]** operator also accepts a list of strings. In this case, you get back a **DataFrame** corresponding to the requested strings.
# code here
data[['Candidate', 'Party']]

Unnamed: 0,Candidate,Party
0,Andrew Jackson,Democratic-Republican
1,John Quincy Adams,Democratic-Republican
2,Andrew Jackson,Democratic
3,John Quincy Adams,National Republican
4,Andrew Jackson,Democratic
5,Henry Clay,National Republican
6,William Wirt,Anti-Masonic
7,Hugh Lawson White,Whig
8,Martin Van Buren,Democratic
9,William Henry Harrison,Whig


A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

Note that we can also use the **to_frame** method to turn a Series into a DataFrame.

Extract one col name "Candidates" from DataFrame it will be a series. Convert series into a DataFrame.

In [31]:
# Code here
candidate = data['Candidate']
print(candidate.head(10))
candidate.to_frame().head(10)

0            Andrew Jackson
1         John Quincy Adams
2            Andrew Jackson
3         John Quincy Adams
4            Andrew Jackson
5                Henry Clay
6              William Wirt
7         Hugh Lawson White
8          Martin Van Buren
9    William Henry Harrison
Name: Candidate, dtype: object


Unnamed: 0,Candidate
0,Andrew Jackson
1,John Quincy Adams
2,Andrew Jackson
3,John Quincy Adams
4,Andrew Jackson
5,Henry Clay
6,William Wirt
7,Hugh Lawson White
8,Martin Van Buren
9,William Henry Harrison


The following cells allow you to **test your understanding**. Let's go over the summary of what we have learnt (see slides).

# Creating DataFrames
Create DataFrame using List and Columns name given in slides.

In [33]:
# Code here
pd.DataFrame([[1,'one'],[2,'two'],[3,'three']], columns = ['Number', 'Description'])

Unnamed: 0,Number,Description
0,1,one
1,2,two
2,3,three


Creating DataFrames using **Dictionary** given in slides.

In [35]:
# Code here
pd.DataFrame({"Fruits":['Strawberry', 'Orange'], "Price":[5.67, 9.65]})

Unnamed: 0,Fruits,Price
0,Strawberry,5.67
1,Orange,9.65


Creating DataFrames using **Series** given in slides.

In [36]:
# Code here
s_a = pd.Series(["a1", "a2", "a3"])
s_b = pd.Series(["b1", "b2", "b3"])
pd.DataFrame({"Column-A1":s_a, "Column-B1":s_b})

Unnamed: 0,Column-A1,Column-B1
0,a1,b1
1,a2,b2
2,a3,b3
