In [None]:
import pandas as pd # A general purpose Python library for data analysis
import numpy as np # A library for scientific computing in Python (e.g., provides high-performance multi-dimensional array objects and operations)

import matplotlib.pyplot as plt # a plotting library for Python and NumPy (readily customizable)
import seaborn as sns # Another plotting library for Python (fewer syntax, excellent default themes, behind the scenes, it uses matplotlib)
import time

## Knowledge Stream Summer 2023

In this notebook, we will learn about the key data structures provided by the Pandas library: **Data Frames, Series, and Indices**.

In addition, we will learn about the following operations:
* How to access data contained in these structures?
* How to read files (e.g., csv, xlsx, sql) to create these structures?
* How to carry out different data manipulation tasks using these structures?

`Dataset`: US elections with information about candidates, their party, votes won, year of election and the result.

## Reading in Data Frames from Files

Pandas has a number of useful file reading tools. You can see them enumerated by typing **"pd.re"** and pressing `tab`. We'll be using **read_csv** today. Note that these file reading functions do all the *data parsing* for you, which is very useful.

Before loading a file into a dataframe, let's first take a look at the **elections.csv** file

In [None]:
#Answer Here
df=pd.read_csv('elections.csv')
print(df)

     Year          Candidate                  Party  Popular vote Result  \
0    1824     Andrew Jackson  Democratic-Republican        151271   loss   
1    1824  John Quincy Adams  Democratic-Republican        113142    win   
2    1828     Andrew Jackson             Democratic        642806    win   
3    1828  John Quincy Adams    National Republican        500897   loss   
4    1832     Andrew Jackson             Democratic        702735    win   
..    ...                ...                    ...           ...    ...   
177  2016         Jill Stein                  Green       1457226   loss   
178  2020       Joseph Biden             Democratic      81268924    win   
179  2020       Donald Trump             Republican      74216154   loss   
180  2020       Jo Jorgensen            Libertarian       1865724   loss   
181  2020     Howard Hawkins                  Green        405035   loss   

             %  
0    57.210122  
1    42.789878  
2    56.203927  
3    43.796073  
4 

We can use the **head command** to show only a few rows of a dataframe.

# heading
## heading2

In [None]:
# Answer Here
df.head(10)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
5,1832,Henry Clay,National Republican,484205,loss,37.603628
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
7,1836,Hugh Lawson White,Whig,146109,loss,10.005985
8,1836,Martin Van Buren,Democratic,763291,win,52.272472
9,1836,William Henry Harrison,Whig,550816,loss,37.721543


There is also a **tail command**.

In [None]:
#Answer Here
df.tail(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979
181,2020,Howard Hawkins,Green,405035,loss,0.255731


The `read_csv` command lets us specify a **column to use an index**. For example, we could have used __Year__ as the index.

In [None]:
#Answer Here
df=pd.read_csv("elections.csv", index_col= "Year")
df

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...
2016,Jill Stein,Green,1457226,loss,1.073699
2020,Joseph Biden,Democratic,81268924,win,51.311515
2020,Donald Trump,Republican,74216154,loss,46.858542
2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


Alternately, we could have used the **set_index** commmand on the dataframe.

In [None]:
#Answer Here
df=pd.read_csv('elections.csv')
print(df)

     Year          Candidate                  Party  Popular vote Result  \
0    1824     Andrew Jackson  Democratic-Republican        151271   loss   
1    1824  John Quincy Adams  Democratic-Republican        113142    win   
2    1828     Andrew Jackson             Democratic        642806    win   
3    1828  John Quincy Adams    National Republican        500897   loss   
4    1832     Andrew Jackson             Democratic        702735    win   
..    ...                ...                    ...           ...    ...   
177  2016         Jill Stein                  Green       1457226   loss   
178  2020       Joseph Biden             Democratic      81268924    win   
179  2020       Donald Trump             Republican      74216154   loss   
180  2020       Jo Jorgensen            Libertarian       1865724   loss   
181  2020     Howard Hawkins                  Green        405035   loss   

             %  
0    57.210122  
1    42.789878  
2    56.203927  
3    43.796073  
4 

In [None]:
print(df.columns)

Index(['Year', 'Candidate', 'Party', 'Popular vote', 'Result', '%'], dtype='object')


In [None]:
df.set_index('Year', inplace=True)
print(df.head())

              Candidate                  Party  Popular vote Result          %
Year                                                                          
1824     Andrew Jackson  Democratic-Republican        151271   loss  57.210122
1824  John Quincy Adams  Democratic-Republican        113142    win  42.789878
1828     Andrew Jackson             Democratic        642806    win  56.203927
1828  John Quincy Adams    National Republican        500897   loss  43.796073
1832     Andrew Jackson             Democratic        702735    win  54.574789


# Caution:
The **set_index command** (along with all other data frame methods) **does not modify the dataframe**, i.e., the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe (e.g., `elections.set_index("Party",inplace=True)`).

## Duplicate Columns?
By contast, column names MUST be unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically any duplicates. Load duplicate_columns.csv

In [None]:
#Answer Here
df2=pd.read_csv('duplicate.csv')
df2

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%,Unnamed: 6,Result.1
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122,,loss
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878,,win
2,1828,Andrew Jackson,Democratic,642806,win,56.203927,,win
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073,,loss
4,1832,Andrew Jackson,Democratic,702735,win,54.574789,,win
...,...,...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699,,loss
178,2020,Joseph Biden,Democratic,81268924,win,51.311515,,win
179,2020,Donald Trump,Republican,74216154,loss,46.858542,,loss
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979,,loss


In [None]:
print(df2.columns)

Index(['Year', 'Candidate', 'Party', 'Popular vote', 'Result', '%',
       'Unnamed: 6', 'Result.1'],
      dtype='object')


## The [ ] Operator & Indexing

The DataFrame class has an indexing operator **[ ]** (also known as the 'brack' operator) that lets you do a variety of different things. If your provide a String to the **[ ]** operator, you get back a ***Series*** corresponding to the requested label.

1.Use **[ ]** to display different columns

2.Use List retrive multiple columns

In [None]:
df.columns

Index(['Candidate', 'Party', 'Popular vote', 'Result', '%'], dtype='object')

In [None]:
#1.Use [ ] to display different columns
print(df[['Result']])

     Result
Year       
1824   loss
1824    win
1828    win
1828   loss
1832    win
...     ...
2016   loss
2020    win
2020   loss
2020   loss
2020   loss

[182 rows x 1 columns]


In [None]:
print(df[['Popular vote']])

      Popular vote
Year              
1824        151271
1824        113142
1828        642806
1828        500897
1832        702735
...            ...
2016       1457226
2020      81268924
2020      74216154
2020       1865724
2020        405035

[182 rows x 1 columns]


In [None]:
#2.Use a List to Retrieve Multiple Columns
print(df[['Candidate', 'Party']])

              Candidate                  Party
Year                                          
1824     Andrew Jackson  Democratic-Republican
1824  John Quincy Adams  Democratic-Republican
1828     Andrew Jackson             Democratic
1828  John Quincy Adams    National Republican
1832     Andrew Jackson             Democratic
...                 ...                    ...
2016         Jill Stein                  Green
2020       Joseph Biden             Democratic
2020       Donald Trump             Republican
2020       Jo Jorgensen            Libertarian
2020     Howard Hawkins                  Green

[182 rows x 2 columns]


The **[ ]** operator also accepts a list of strings. In this case, you get back a **DataFrame** corresponding to the requested strings.

In [None]:
# Answer Here
import pandas as pd

# Load the CSV file
df = pd.read_csv('duplicate.csv')


print("Retrieving 'Candidate' and 'Party' columns as a DataFrame:")
candidate_party_df = df[['Candidate', 'Party']]
print(candidate_party_df)

print("\nRetrieving 'Year', 'Candidate', and 'Popular vote' columns as a DataFrame:")
year_candidate_vote_df = df[['Year', 'Candidate', 'Popular vote']]
print(year_candidate_vote_df)


Retrieving 'Candidate' and 'Party' columns as a DataFrame:
             Candidate                  Party
0       Andrew Jackson  Democratic-Republican
1    John Quincy Adams  Democratic-Republican
2       Andrew Jackson             Democratic
3    John Quincy Adams    National Republican
4       Andrew Jackson             Democratic
..                 ...                    ...
177         Jill Stein                  Green
178       Joseph Biden             Democratic
179       Donald Trump             Republican
180       Jo Jorgensen            Libertarian
181     Howard Hawkins                  Green

[182 rows x 2 columns]

Retrieving 'Year', 'Candidate', and 'Popular vote' columns as a DataFrame:
     Year          Candidate  Popular vote
0    1824     Andrew Jackson        151271
1    1824  John Quincy Adams        113142
2    1828     Andrew Jackson        642806
3    1828  John Quincy Adams        500897
4    1832     Andrew Jackson        702735
..    ...                ...   

A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

Note that we can also use the **to_frame** method to turn a Series into a DataFrame.

Extract one col name "Candidates" from DataFrame it will be a series. Convert series into a DataFrame.

In [None]:
# Answer Here
import pandas as pd

# Load the CSV file
df = pd.read_csv('elections.csv')


candidate_series = df['Candidate']
print("Series:")
print(candidate_series)


candidate_df = candidate_series.to_frame()
print("\nDataFrame:")
print(candidate_df)


candidate_df_alternative = df[['Candidate']]
print("\nDataFrame using a list of one label:")
print(candidate_df_alternative)


Series:
0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

DataFrame:
             Candidate
0       Andrew Jackson
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
4       Andrew Jackson
..                 ...
177         Jill Stein
178       Joseph Biden
179       Donald Trump
180       Jo Jorgensen
181     Howard Hawkins

[182 rows x 1 columns]

DataFrame using a list of one label:
             Candidate
0       Andrew Jackson
1    John Quincy Adams
2       Andrew Jackson
3    John Quincy Adams
4       Andrew Jackson
..                 ...
177         Jill Stein
178       Joseph Biden
179       Donald Trump
180       Jo Jorgensen
181     Howard Hawkins

[182 rows x 1 columns]


### Row Indexing

The `[]` operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

Extract few rows from DataFrame

In [None]:
# Answer Here
# Extract the first 5 rows
print("First 5 rows:")
print(df[:5])

# Extract rows from index 10 to 15
print("\nRows from index 10 to 15:")
print(df[10:16])

# Extract the last 5 rows
print("\nLast 5 rows:")
print(df[-5:])


First 5 rows:
   Year          Candidate                  Party  Popular vote Result  \
0  1824     Andrew Jackson  Democratic-Republican        151271   loss   
1  1824  John Quincy Adams  Democratic-Republican        113142    win   
2  1828     Andrew Jackson             Democratic        642806    win   
3  1828  John Quincy Adams    National Republican        500897   loss   
4  1832     Andrew Jackson             Democratic        702735    win   

           %  
0  57.210122  
1  42.789878  
2  56.203927  
3  43.796073  
4  54.574789  

Rows from index 10 to 15:
    Year               Candidate       Party  Popular vote Result          %
10  1840        Martin Van Buren  Democratic       1128854   loss  46.948787
11  1840  William Henry Harrison        Whig       1275583    win  53.051213
12  1844              Henry Clay        Whig       1300004   loss  49.250523
13  1844              James Polk  Democratic       1339570    win  50.749477
14  1848              Lewis Cass  Democ

If you provide a single argument to the `[]` operator, it tries to use it as a name. This is true even if the argument passed to **[ ]** is an integer.

In [None]:
#elections[0] #this does not work, try uncommenting this to see it fail in action, woo
print("First row using .iloc[]:")
print(df.iloc[0])

# Extract rows from index 0 to 4 using .iloc[]
print("\nRows from index 0 to 4 using .iloc[]:")
print(df.iloc[0:5])

# Assuming 'Year' is the index for label-based indexing
df.set_index('Year', inplace=True)

# Extract the row with index label 1824 using .loc[]
print("\nRow with index label 1824 using .loc[]:")
print(df.loc[1824])

First row using .iloc[]:
Year                             1824
Candidate              Andrew Jackson
Party           Democratic-Republican
Popular vote                   151271
Result                           loss
%                           57.210122
Name: 0, dtype: object

Rows from index 0 to 4 using .iloc[]:
   Year          Candidate                  Party  Popular vote Result  \
0  1824     Andrew Jackson  Democratic-Republican        151271   loss   
1  1824  John Quincy Adams  Democratic-Republican        113142    win   
2  1828     Andrew Jackson             Democratic        642806    win   
3  1828  John Quincy Adams    National Republican        500897   loss   
4  1832     Andrew Jackson             Democratic        702735    win   

           %  
0  57.210122  
1  42.789878  
2  56.203927  
3  43.796073  
4  54.574789  

Row with index label 1824 using .loc[]:
              Candidate                  Party  Popular vote Result          %
Year                          

The following cells allow you to **test your understanding**. Let's go over the summary of what we have learnt (see slides).

# Creating DataFrames
Create DataFrame using List and Columns name.

In [None]:
data = [
    [1, 'A', 24],
    [2, 'B', 27],
    [3, 'C', 22]
]
columns = ['ID', 'Name', 'Age']

df = pd.DataFrame(data, columns=columns)
print(df)


   ID     Name  Age
0   1    Alice   24
1   2      Bob   27
2   3  Charlie   22


Creating DataFrames using **Dictionary**.

In [None]:
# Answer Here

data = {
    'ID': [1, 2, 3],
    'Name': ['A', 'B', 'C'],
    'Age': [24, 27, 22]
}

df = pd.DataFrame(data)
print(df)


   ID Name  Age
0   1    A   24
1   2    B   27
2   3    C   22


## Filtering via Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a **filtered version of the data frame**, where **only rows corresponding to True appear**.

In [None]:
elections[[False, False, False, False, False,
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, True, False]]

One very common task in Data Science is **filtering**. Boolean Array Selection is one way to achieve this in Pandas. We start by observing that **logical operators** like the equality operator can be applied to **Pandas Series data** to generate a **Boolean Array**.

Compare the 'Result' column to the String 'win' and Show results

In [None]:
#Answer Here
import pandas as pd

# Sample data
data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['A', 'B', 'C', 'D', 'E'],
    'Result': ['win', 'lose', 'win', 'lose', 'win']
}


df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

bool_array = df['Result'] == 'win'

print("\nBoolean array:")
print(bool_array)
filtered_df = df[bool_array]


print("\nFiltered DataFrame:")
print(filtered_df)


Original DataFrame:
   ID Name Result
0   1    A    win
1   2    B   lose
2   3    C    win
3   4    D   lose
4   5    E    win

Boolean array:
0     True
1    False
2     True
3    False
4     True
Name: Result, dtype: bool

Filtered DataFrame:
   ID Name Result
0   1    A    win
2   3    C    win
4   5    E    win


Compare the 'Party' column to the String 'Democratic' and Show results

In [None]:
#Answer Here
import pandas as pd

# Sample data
data = {
    'Candidate': ['A', 'B', 'C', 'D', 'E'],
    'Party': ['Democratic', 'Republican', 'Democratic', 'Independent', 'Democratic']
}

# Create DataFrame
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

bool_array = df['Party'] == 'Democratic'

print("\nBoolean array:")
print(bool_array)

filtered_df = df[bool_array]

print("\nFiltered DataFrame:")
print(filtered_df)


Original DataFrame:
  Candidate        Party
0         A   Democratic
1         B   Republican
2         C   Democratic
3         D  Independent
4         E   Democratic

Boolean array:
0     True
1    False
2     True
3    False
4     True
Name: Party, dtype: bool

Filtered DataFrame:
  Candidate       Party
0         A  Democratic
2         C  Democratic
4         E  Democratic


The output of the logical operator applied to the Series is **another Series with the same name and index, but of datatype boolean**.

These boolean Series can be used as an argument to the `[]` operator.

Creates  DataFrame of all election winners since 1980.

In [None]:
import pandas as pd

# Sample election results data (hypothetical example)
data = {
    'Year': [1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016],
    'Winner': ['Candidate A', 'Candidate B', 'Candidate A', 'Candidate C', 'Candidate B',
               'Candidate C', 'Candidate A', 'Candidate B', 'Candidate A', 'Candidate B']
}

# Create DataFrame
election_results = pd.DataFrame(data)

# Display the DataFrame
print("Election Results DataFrame:")
print(election_results)

# Filter for winners since 1980
winners_since_1980 = election_results[election_results['Year'] >= 1980]

# Display the winners since 1980
print("\nWinners since 1980:")
print(winners_since_1980)


Election Results DataFrame:
   Year       Winner
0  1980  Candidate A
1  1984  Candidate B
2  1988  Candidate A
3  1992  Candidate C
4  1996  Candidate B
5  2000  Candidate C
6  2004  Candidate A
7  2008  Candidate B
8  2012  Candidate A
9  2016  Candidate B

Winners since 1980:
   Year       Winner
0  1980  Candidate A
1  1984  Candidate B
2  1988  Candidate A
3  1992  Candidate C
4  1996  Candidate B
5  2000  Candidate C
6  2004  Candidate A
7  2008  Candidate B
8  2012  Candidate A
9  2016  Candidate B


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

Show all 'win' results between 1980 to 2000

In [None]:
#Answer Here

data = {
    'Year': [1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016],
    'Result': ['win', 'lose', 'win', 'lose', 'win', 'lose', 'win', 'lose', 'win', 'lose']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)


filtered_df = df[(df['Year'] >= 1980) & (df['Year'] <= 2000) & (df['Result'] == 'win')]

print("\nFiltered DataFrame (win results between 1980 and 2000):")
print(filtered_df)


Original DataFrame:
   Year Result
0  1980    win
1  1984   lose
2  1988    win
3  1992   lose
4  1996    win
5  2000   lose
6  2004    win
7  2008   lose
8  2012    win
9  2016   lose

Filtered DataFrame (win results between 1980 and 2000):
   Year Result
0  1980    win
2  1988    win
4  1996    win


Show all 'Loss' results of Independent party

In [None]:
# Answer Here
import pandas as pd

data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Result': ['win', 'lose', 'lose', 'win', 'win']
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Filter for 'Loss' results of 'Independent' party
filtered_df = df[(df['Party'] == 'Independent') & (df['Result'] == 'lose')]

# Display the filtered DataFrame
print("\nFiltered DataFrame ('Loss' results of 'Independent' party):")
print(filtered_df)


Original DataFrame:
  Candidate        Party Result
0     Alice   Democratic    win
1       Bob   Republican   lose
2   Charlie  Independent   lose
3     David  Independent    win
4       Eva   Democratic    win

Filtered DataFrame ('Loss' results of 'Independent' party):
  Candidate        Party Result
2   Charlie  Independent   lose


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

Show results of win with percentage less than 50%

In [None]:
# Answer Here
data = {
    'Candidate': ['A', 'B', 'C', 'D', 'E'],
    'Result': ['win', 'lose', 'win', 'lose', 'win'],
    'Percentage': [45, 55, 30, 70, 40]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Create boolean Series for the conditions
win_condition = df['Result'] == 'win'
percentage_condition = df['Percentage'] < 50

# Combine conditions using & operator
filtered_df = df[win_condition & percentage_condition]

# Display the filtered DataFrame
print("\nFiltered DataFrame (win results with percentage less than 50%):")
print(filtered_df)


Original DataFrame:
  Candidate Result  Percentage
0         A    win          45
1         B   lose          55
2         C    win          30
3         D   lose          70
4         E    win          40

Filtered DataFrame (win results with percentage less than 50%):
  Candidate Result  Percentage
0         A    win          45
2         C    win          30
4         E    win          40


Show all 'win' results between 1980 to 2000

In [None]:
# Answer Here


# Sample data
data = {
    'Year': [1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008],
    'Result': ['win', 'lose', 'win', 'lose', 'win', 'lose', 'win', 'lose']
}

# Create DataFrame and filter
filtered_df = pd.DataFrame(data)[(pd.DataFrame(data)['Year'].between(1980, 2000)) & (pd.DataFrame(data)['Result'] == 'win')]

# Display the filtered DataFrame
print(filtered_df)


   Year Result
0  1980    win
2  1988    win
4  1996    win


## Loc and iLoc

Show 5 enteries from start

In [None]:
# Answer Here
import pandas as pd

# Sample data
data = {
    'Year': [1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008],
    'Result': ['win', 'lose', 'win', 'lose', 'win', 'lose', 'win', 'lose']
}

df = pd.DataFrame(data)

print("Using .loc:")
print(df.loc[:4])

print("\nUsing .iloc:")
print(df.iloc[:5])  # Select rows by position


You can provide `.loc` a list of row labels [0-5] and column labels ['Candidate','Party', 'Year'] as input to return a dataframe

In [None]:

data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Year': [1980, 1984, 1988, 1992, 1996]
}

df = pd.DataFrame(data)

selected_df = df.loc[0:4, ['Candidate', 'Party', 'Year']]

print(selected_df)


  Candidate        Party  Year
0     Alice   Democratic  1980
1       Bob   Republican  1984
2   Charlie  Independent  1988
3     David  Independent  1992
4       Eva   Democratic  1996


Loc also supports **slicing** (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

Use Slicing on Rows and Columns

In [None]:
# Answer Here

data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Year': [1980, 1984, 1988, 1992, 1996],
    'Result': ['win', 'lose', 'win', 'lose', 'win']
}

df = pd.DataFrame(data)

# Show a slice of rows and columns using .loc
sliced_df = df.loc[1:3, 'Candidate':'Year']

print(sliced_df)


  Candidate        Party  Year
1       Bob   Republican  1984
2   Charlie  Independent  1988
3     David  Independent  1992


If we provide only a **single label** for the column argument, we get back a **Series**.

In [None]:
# Answer Here


data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Year': [1980, 1984, 1988, 1992, 1996],
    'Result': ['win', 'lose', 'win', 'lose', 'win']
}

# Create DataFrame
df = pd.DataFrame(data)

# Select a single column using .loc
candidate_series = df.loc[:, 'Candidate']  # Select all rows for the 'Candidate' column

# Display the resulting Series
print(candidate_series)


0      Alice
1        Bob
2    Charlie
3      David
4        Eva
Name: Candidate, dtype: object


If we want a data frame instead and don't want to use to_frame, we can provide a **list** containing the column name.

In [None]:
# Answer Here

import pandas as pd

# Sample data
data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Year': [1980, 1984, 1988, 1992, 1996],
    'Result': ['win', 'lose', 'win', 'lose', 'win']
}

df = pd.DataFrame(data)

# Select a single column as a DataFrame using .loc
candidate_df = df.loc[:, ['Candidate']]

# Display the resulting DataFrame
print(candidate_df)


  Candidate
0     Alice
1       Bob
2   Charlie
3     David
4       Eva


If we give only one row but many column labels, we'll get back a **Series** corresponding to a row of the table. This new Series has a neat index, where **each entry is the name of the column** that the data came from.

In [None]:
# Answer Here

data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Year': [1980, 1984, 1988, 1992, 1996],
    'Result': ['win', 'lose', 'win', 'lose', 'win']
}

# Create DataFrame
df = pd.DataFrame(data)

# Select a single row with multiple column labels using .loc
single_row_series = df.loc[1, ['Candidate', 'Party', 'Year']]

print(single_row_series)


Candidate           Bob
Party        Republican
Year               1984
Name: 1, dtype: object


If we omit the column argument altogether, the **default behavior is to retrieve all columns**.

In [None]:
# Answer Here
data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Year': [1980, 1984, 1988, 1992, 1996],
    'Result': ['win', 'lose', 'win', 'lose', 'win']
}
df = pd.DataFrame(data)
single_row = df.loc[2]
print(single_row)


Candidate        Charlie
Party        Independent
Year                1988
Result               win
Name: 2, dtype: object


Specify Rows and Columns as List to retrive specific enteries

In [None]:
# Answer Here
data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Year': [1980, 1984, 1988, 1992, 1996],
    'Result': ['win', 'lose', 'win', 'lose', 'win']
}
df = pd.DataFrame(data)
rows = [0, 2, 4]
columns = ['Candidate', 'Result']
specific_entries = df.loc[rows, columns]
print(specific_entries)


  Candidate Result
0     Alice    win
2   Charlie    win
4       Eva    win


Boolean Series are also boolean arrays, so we can use the Boolean Array Selection from earlier using loc as well.

In [None]:
# Answer Here
data = {
    'Candidate': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Party': ['Democratic', 'Republican', 'Independent', 'Independent', 'Democratic'],
    'Year': [1980, 1984, 1988, 1992, 1996],
    'Result': ['win', 'lose', 'win', 'lose', 'win']
}
df = pd.DataFrame(data)
boolean_series = df['Result'] == 'win'
filtered_df = df.loc[boolean_series]
print(filtered_df)

  Candidate        Party  Year Result
0     Alice   Democratic  1980    win
2   Charlie  Independent  1988    win
4       Eva   Democratic  1996    win


## String-labeled Rows

Let's do a quick example using data with string-labeled rows instead of integer labeled rows, just to make sure we're really understanding loc.

Use mottos.csv file

In [21]:
# Answer Here
df = pd.read_csv('mottos.csv', index_col=0)

print("DataFrame from mottos.csv:")
print(df)
motto_entry = df.loc['Inspire']
print("\nMotto entry for 'Inspire':")
print(motto_entry)


DataFrame from mottos.csv:
                                                            Motto  \
State                                                               
Alabama                             Audemus jura nostra defendere   
Alaska                                        North to the future   
Arizona                                                Ditat Deus   
Arkansas                                           Regnat populus   
California                                        Eureka (Εὕρηκα)   
Colorado                                          Nil sine numine   
Connecticut                               Qui transtulit sustinet   
Delaware                                 Liberty and Independence   
Florida                                           In God We Trust   
Georgia                               Wisdom, Justice, Moderation   
Hawaii                          Ua mau ke ea o ka ʻāina i ka pono   
Idaho                                               Esto perpetua   
Illinoi

KeyError: 'Inspire'

Extract slice, can be specified using slice notation, even if the rows have string labels instead of integer labels.

### iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. 'iloc' slicing is **exclusive**, just like standard Python slicing of numerical values.

Use iloc to extract first 3 rows and columns from elections DataFrame

In [22]:
#Answer Here
import pandas as pd

df = pd.read_csv('mottos.csv', index_col=0)

print("DataFrame from mottos.csv:")
print(df)

print("\nAvailable index labels:")
print(df.index.tolist())

try:
    motto_entry = df.loc['Inspire']
    print("\nMotto entry for 'Inspire':")
    print(motto_entry)
except KeyError as e:
    print(f"KeyError: {e} - The label does not exist in the DataFrame.")


DataFrame from mottos.csv:
                                                            Motto  \
State                                                               
Alabama                             Audemus jura nostra defendere   
Alaska                                        North to the future   
Arizona                                                Ditat Deus   
Arkansas                                           Regnat populus   
California                                        Eureka (Εὕρηκα)   
Colorado                                          Nil sine numine   
Connecticut                               Qui transtulit sustinet   
Delaware                                 Liberty and Independence   
Florida                                           In God We Trust   
Georgia                               Wisdom, Justice, Moderation   
Hawaii                          Ua mau ke ea o ka ʻāina i ka pono   
Idaho                                               Esto perpetua   
Illinoi

We will use both `loc` and `iloc` in the course. `loc` is generally preferred for a number of reasons, for example:

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g., what column #17 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Handy Properties and Utility Functions for Series and DataFrames

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

In [23]:
mottos = pd.read_csv("mottos.csv")

In [None]:
# Answer Here

Size of DataFrame

In [24]:
# Answer Here
mottos.shape

(50, 5)

The fact that the size is 250 means our data file is relatively small, with only 250 total entries.

Shape of DataFrame

In [25]:
# Answer Here
df = pd.read_csv('mottos.csv', index_col=0)

shape = df.shape

# Display the shape
print(f"Shape of DataFrame: {shape} (Rows, Columns)")


Shape of DataFrame: (50, 4) (Rows, Columns)


Use describe function and extract the meaningful information from DataFrame

In [26]:
# Answer Here
df = pd.read_csv('mottos.csv')
description = df.describe()
print("Statistical Summary of the DataFrame:")
print(description)


Statistical Summary of the DataFrame:
          State                          Motto Translation Language  \
count        50                             50          49       50   
unique       50                             50          30        8   
top     Alabama  Audemus jura nostra defendere           —    Latin   
freq          1                              1          20       23   

       Date Adopted  
count            50  
unique           47  
top            1893  
freq              2  


Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

> Add blockquote



In [27]:
# Answer Here
index = df.index
print("Index of the DataFrame:")
print(index)

Index of the DataFrame:
RangeIndex(start=0, stop=50, step=1)


In [28]:
mottos.head(2)

Unnamed: 0,State,Motto,Translation,Language,Date Adopted
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
1,Alaska,North to the future,—,English,1967


It turns out the columns also have an Index. We can access this index by using `.columns`.

In [29]:
# Answer Here
columns = mottos.columns
print("Columns of the DataFrame:")
print(columns)

Columns of the DataFrame:
Index(['State', 'Motto', 'Translation', 'Language', 'Date Adopted'], dtype='object')


## Sorting and Value Counts

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [32]:
# Answer Here
elections = pd.read_csv('elections.csv')
elections.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

If we want to sort in reverse order, we can set `ascending=False`.

In [33]:
elections.sort_values('%', ascending=False)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
114,1964,Lyndon Johnson,Democratic,43127041,win,61.344703
91,1936,Franklin Roosevelt,Democratic,27752648,win,60.978107
120,1972,Richard Nixon,Republican,47168710,win,60.907806
79,1920,Warren Harding,Republican,16144093,win,60.574501
133,1984,Ronald Reagan,Republican,54455472,win,59.023326
...,...,...,...,...,...,...
165,2008,Cynthia McKinney,Green,161797,loss,0.123442
148,1996,John Hagelin,Natural Law,113670,loss,0.118219
160,2004,Michael Peroutka,Constitution,143630,loss,0.117542
141,1992,Bo Gritz,Populist,106152,loss,0.101918


We can also use `sort_values` on Series objects.

In [34]:
mottos['Language'].sort_values().head(50)

46    Chinook Jargon
49           English
29           English
28           English
27           English
26           English
48           English
37           English
38           English
40           English
17           English
34           English
42           English
14           English
41           English
12           English
1            English
13           English
8            English
7            English
9            English
43           English
22            French
4              Greek
10          Hawaiian
19           Italian
39             Latin
44             Latin
36             Latin
45             Latin
47             Latin
35             Latin
33             Latin
0              Latin
31             Latin
30             Latin
23             Latin
21             Latin
20             Latin
18             Latin
16             Latin
15             Latin
11             Latin
6              Latin
5              Latin
3              Latin
2              Latin
32           

For Series, the `value_counts` method is often quite handy.

In [35]:
mottos['Language'].value_counts()

Language
Latin             23
English           21
Greek              1
Hawaiian           1
Italian            1
French             1
Spanish            1
Chinook Jargon     1
Name: count, dtype: int64

Also commonly used is the `unique` method, which returns **all unique values** as a numpy array.

In [36]:
mottos['Language'].unique()

array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

In [37]:
def fiba(n):
    if n < 2:
        return n
    else:
        return fiba(n-1) + fiba(n-2)



fiba(5)

5

# Thank you!