# Pandas Exercises

## Creating DataFrames and Using Sample Data Sets

This is the Jupyter Notebook runnable exercises version of the article, [Pandas Practice Questions – Fifty-Two Examples to Make You an Expert](https://codesolid.com/pandas-practice-questions-twenty-one-examples-to-make-you-an-expert/).

In [2]:
import pandas as pd
import numpy as np
import seaborn as sb

**1.** Using NumPy, create a Pandas DataFrame with five rows and three columms:

In [3]:
df = pd.DataFrame(np.random.randn(5, 3))
df

Unnamed: 0,0,1,2
0,2.376482,1.134457,0.016314
1,-0.51965,-0.310522,-0.553385
2,-0.266015,0.715352,-0.419171
3,0.157374,-0.788791,-0.039851
4,0.247674,1.263553,2.339221


**2.** For a Pandas DataFrame created from a NumPy array, what is the default behavior for the labels for the columns?  For the rows?

Rows (index): Labeled with integers starting from 0
Columns: Labeled with integers starting from 0.

**3.** Create a second DataFrame as above with five rows and three columns, setting the row labels to the names of any five major US cities and the column labels to the first three months of the year.

In [4]:
cities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
months = ['January', 'February', 'March']

df2 = pd.DataFrame(np.random.randn(5, 3), index=cities, columns=months)
df2

Unnamed: 0,January,February,March
New York,-0.496714,-0.207348,-1.390749
Los Angeles,-1.532369,-0.284603,-2.211472
Chicago,-0.911094,-0.026145,0.570479
Houston,0.527791,1.429712,2.500285
Phoenix,-0.018728,0.861378,0.023091


**4.** You recall that the Seaborn package has some data sets built in, but can't remember how to list and load them. Assuming the functions to do so have "data" in the name, how might you locate them?  You can assume a Jupyter Notebook / IPython environment and explain the process, or write the code to do it in Python.

# To find functions containing 'data' in seaborn

In [5]:
[x for x in dir(sb) if 'data' in x]

['get_data_home', 'get_dataset_names', 'load_dataset']

## Loading data from CSV

**5**. Zillow home data is available at this URL: https://files.zillowstatic.com/research/public_csvs/zhvi/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv

Open this file as a DataFrame named df_homes in Pandas.

In [8]:
url = "https://files.zillowstatic.com/research/public_csvs/zhvi/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv"
df_homes = pd.read_csv(url)
df_homes.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,2000-01-31,2000-02-29,2000-03-31,2000-04-30,2000-05-31,...,2024-12-31,2025-01-31,2025-02-28,2025-03-31,2025-04-30,2025-05-31,2025-06-30,2025-07-31,2025-08-31,2025-09-30
0,102001,0,United States,country,,123328.165417,123545.139196,123814.21863,124391.341197,125055.54008,...,365372.217349,366007.694288,366471.281294,366193.660206,365661.222907,364970.033287,364347.749809,363917.639079,363688.149053,363931.687425
1,394913,1,"New York, NY",msa,NY,222096.674116,223040.459669,223992.986378,225923.174072,227921.950272,...,696533.300128,697633.014562,698948.623497,700676.33695,703154.623667,704955.084297,706457.1221,707650.653457,708454.797763,709880.464236
2,753899,2,"Los Angeles, CA",msa,CA,222620.175787,223448.605299,224552.065508,226747.580645,229148.785908,...,968173.703124,968714.163371,966675.169598,961554.657443,957217.264182,952421.630548,948047.015396,945442.656758,944366.276929,945428.022538
3,394463,3,"Chicago, IL",msa,IL,155857.310945,156001.589412,156276.370294,156959.956841,157782.229234,...,332649.038316,334015.386374,335383.265039,336242.517133,336852.936258,337148.112023,337514.741114,338367.259828,339378.680707,340732.680569
4,394514,4,"Dallas, TX",msa,TX,128023.642757,128080.664935,128146.217766,128316.451582,128540.900207,...,377150.293426,376619.688458,375829.848426,374322.685671,372189.0979,369765.587859,367394.345281,365371.53283,364005.278732,363356.229199


**6.** Save the DataFrame, df_homes, to a local CSV file, "zillow_home_data.csv".  

In [9]:
df_homes.to_csv("zillow_home_data.csv", index=False)

**7.** Load zillow_home_data.csv back into a new Dataframe, df_homes_2

In [10]:
df_homes_2 = pd.read_csv("zillow_home_data.csv")

**8.** Compare the dimensions of the two DataFrames, df_homes and df_homes_2.  Are they equal?  If not, how can you fix it?

In [11]:
print(df_homes.shape)
print(df_homes_2.shape)

(895, 314)
(895, 314)


Now their dimensions will match.

**9.** A remote spreadsheet showing how a snapshot of how traffic increased for a hypothetical website is available here: https://github.com/CodeSolid/CodeSolid.github.io/raw/main/booksource/data/AnalyticsSnapshot.xlsx. Load the worksheet page of the spreasheet data labelled "February 2022" as a DataFrame named "feb".  Note: the leftmost column in the spreadsheet is the index column.

In [12]:
url_excel = "https://github.com/CodeSolid/CodeSolid.github.io/raw/main/booksource/data/AnalyticsSnapshot.xlsx"
feb = pd.read_excel(url_excel, sheet_name="February 2022", index_col=0)
feb.head()

Unnamed: 0,This Month,Last Month,Month to Month Increase
Users,1800,280,5.428571
New Users,1700,298,4.704698
Page Views,2534,436,4.811927


**10.** The "Month to Month Increase" column is a bit hard to understand, so ignore it for now.  Given the values for "This Month" and "Last Month", create a new column, "Percentage Increase".

In [13]:
feb["Percentage Increase"] = ((feb["This Month"] - feb["Last Month"]) / feb["Last Month"]) * 100
feb.head()

Unnamed: 0,This Month,Last Month,Month to Month Increase,Percentage Increase
Users,1800,280,5.428571,542.857143
New Users,1700,298,4.704698,470.469799
Page Views,2534,436,4.811927,481.192661


## Basic Operations on Data

**11.** Using Seaborn, get a dataset about penguins into a dataframe named "df_penguins".  Note that because all of the following questions depend on this example, we'll provide the solution here so no one gets stuck:

In [14]:
df_penguins = sb.load_dataset('penguins')

**12.** Write the code to show the the number of rows and columns in df_penguins

In [15]:
df_penguins.shape

(344, 7)

**13.** How might you show the first few rows of df_penguins?

In [16]:
df_penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


**14.** How can you return the unique species of penguins from df_penguins?  How many unique species are there?

In [17]:
unique_species = df_penguins["species"].unique()
num_species = df_penguins["species"].nunique()

print("Unique species:", unique_species)
print("Number of unique species:", num_species)

Unique species: ['Adelie' 'Chinstrap' 'Gentoo']
Number of unique species: 3


**15.** What function can we use to drop the rows that have missing data?

In [18]:
df_penguins_clean = df_penguins.dropna()
df_penguins_clean.shape

(333, 7)

**16.** By default, will this modify df_penguins or will it return a copy?

**17.** How can we override the default?

**18.** Create a new DataFrame, df_penguins_full, with the missing data deleted.

**19.** What is the average bill length of a penguin, in millimeters, in this (df_full) data set?

**20.** Which of the following is most strongly correlated with bill length?  a) Body mass?  b) Flipper length?  c) Bill depth?  Show how you arrived at the answer.

**21.** How could you show the median flipper length, grouped by species?

**22.** Which species has the longest flippers?

**23.** Which two species have the most similar mean weight?  Show how you arrived at the answer.

**24.** How could you sort the rows by bill length?

**25.** How could you run the same sort in descending order?

**26.** How could you sort by species first, then by body mass?

## Selecting Rows, Columns, and Cells

Let's look at some precious stones now, and leave the poor penguins alone for a while.  Let's look at some precious stones now, and leave the poor penguins alone for a while.  

**27.** Load the Seaborn "diamonds" dataset into a Pandas dataframe named diamonds.

**28.** Display the columns that are available.

**29.** If you select a single column from the diamonds DataFrame, what will be the type of the return value?

**30.** Select the 'table' column and show its type

**31.** Select the first ten rows of the price and carat columns ten rows of the diamonds DataFrame into a variable called subset, and display them.

**32.** For a given column, show the code to display the datatype of the _values_ in the column?  

**33.** Select the first row of the diamonds DataFrame into a variable called row.

**34.** What would you expect the data type of the row to be?  Display it.

A Pandas series

**35.** Can you discover the names of the columns using only the row returned in #33?  Why or why not?Can you discover the names of the columns using only the row returned in #33?  Why or why not?

Yes, because a row series should have the columns as the index (See below):

**36.** Select the row with the highest priced diamond.

**37.** Select the row with the lowest priced diamond.

## Some Exercises Using Time Series

**38.** Load the taxis dataset into a DataFrame, ```taxis```.

**39.** The 'pickup' column contains the date and time the customer picked up, but it's a string.  Add a column to the DataFrame, 'pickup_time', containing the value in 'pickup' as a DateTime.

**40.** We have a hypothesis that as the day goes on, the tips get higher.  We'll need to wrangle the data a bit before testing this, however.  First, now that we have a datetime column, pickup_time, create a subset of it to create a new DataFrame, taxis_one_day. This new DataFrame should have values between '2019-03-23 00:06:00' (inclusive) and '2019-03-24 00:00:00' (exlusive).

**41.** We now have a range from morning until midnight, but we to take the mean of the numeric columns, grouped at one hour intervals.  Save the result as df_means, and display it.

**42.** Create a simple line plot of the value "distance".  

**43.** Overall, do riders travel further or less far as the day progresses?

**44.** Create a new column in taxis_means, ```tip_in_percent```.  The source columns for this should be "fare" and "tip"

**45.** Create a new column, time_interval, as a range of integer values beginning with zero.

Display the correlations between the following pairs of values:
1. tip_in_percent and distance.
1. tip_in_percent and passengers.
1. tip_in_percent and time_interval.

**47.** Admittedly, the size of the data set is fairly small given how we've subsetted it.  But based on the values in #45, which of the three pairs show the strongest correlation.

**48.** Did our hypothesis that people tip more as the day goes on turn out to be warranted?