# Lab 03: Fun with `pandas`!

Below are some exercises to get you working with `pandas` to manipulate data. As always, get as far as you can, and ask for help when you need it! Your teacher (me), you instructor, and your classmates are all here to help each other get better at coding. Getting the code to work is important, but do also take the time to make sure you understand what the commands are doing. This time, (with the exception of the Stroop challenge), all I've given you is the code to download the data. Then you are on your own. For the Stroop challenge, I gave the you code for the first step—after that, it's up to you :-)

## Music sales challenge

Write a script that:

1. Combines the tables of best-selling physical singles and best-selling digital singles on the Wikipedia page "List_of_best-selling_singles"
2. Adds a column which marks whether each row is from the list of physical singles or digital singles
3. Outputs the artist and single name for the year you were born. If there is no entry for that year, take the closest year after you were born.
4. Outputs the artist and single name for the year you were 15 years old.

In [36]:
# Starter code...

# %pip install lxml
import pandas as pd

rawdata = pd.read_html("https://en.wikipedia.org/wiki/List_of_best-selling_singles") # this gives a list of data frames
# print(rawdata)


# Inspect the data - find which table corresponds to digital and physical singles
# rawdata is a list of DataFrames, so we can inspect each DataFrame
for i, df in enumerate(rawdata):
    print(f"Table {i}:")
    #print(df.head(), '\n')  # Print the first few rows of each table to identify them

physical_singles = rawdata[0]
digital_singles = rawdata[3]

# add row to identify type of single
physical_singles['Type'] = 'Physical'
digital_singles['Type'] = 'Digital'

singles_merged = pd.concat([digital_singles,physical_singles], ignore_index=False)
#print(singles_merged)

Table 0:
Table 1:
Table 2:
Table 3:
Table 4:
Table 5:
Table 6:
Table 7:
Table 8:
Table 9:
Table 10:
Table 11:
Table 12:
Table 13:


In [38]:
# Clean the sales column so only numbers
print(singles_merged.columns) # check column names
singles_merged['Sales'] = singles_merged['Sales (in millions)'] # add new column called Sales
singles_merged['Sales'] = singles_merged['Sales'].replace(r'[^\d.]', '', regex=True).astype(float) # clean 

# sort by descending
singles_merged = singles_merged.sort_values(by="Sales", ascending = False)

# re-index list
singles_merged.reset_index(drop=True, inplace=True)

print(singles_merged)

Index(['Artist', 'Single', 'Released', 'Sales (in millions)', 'Source', 'Type',
       'Sales'],
      dtype='object')
                                  Artist  \
0                            Bing Crosby   
1                             Ed Sheeran   
2      Luis Fonsi featuring Daddy Yankee   
3                             Elton John   
4                Rihanna featuring Drake   
5                            Bing Crosby   
6                             Tino Rossi   
7                Bill Haley & His Comets   
8                        Whitney Houston   
9          The Chainsmokers and Coldplay   
10                            Ed Sheeran   
11    Wiz Khalifa featuring Charlie Puth   
12     The Chainsmokers featuring Halsey   
13                                 Adele   
14                        USA for Africa   
15                         Elvis Presley   
16      Mark Ronson featuring Bruno Mars   
17                         Billie Eilish   
18                            Ed Sheeran   
1

In [44]:
# output artist and single from year I was born - 2000. 
target_year = 2000

singles_2000 = singles_merged[singles_merged['Released'] == target_year]

if not singles_2000.empty:
    print(f"Singles from the year {target_year}:")
    print(singles_2000[['Artist', 'Single']])
else:
    # Step 3: If no singles from 2000, search for the closest year with singles
    print(f"No singles found from the year {target_year}. Searching for the closest year...")
# Initialize variables for the closest year search
    found_singles = False
    offset = 1  # Start checking 1 year before/after the target year
    
    while not found_singles:
        # Check the previous year and next year alternately
        year_before = target_year - offset
        year_after = target_year + offset
        
        # Check if there are singles from year_before or year_after
        singles_before = singles_merged[singles_merged['Released'] == year_before]
        singles_after = singles_merged[singles_merged['Released'] == year_after]
        
        if not singles_before.empty:
            print(f"Singles from the year {year_before}:")
            print(singles_before[['Artist', 'Single']])
            found_singles = True  # Exit the loop when found
        elif not singles_after.empty:
            print(f"Singles from the year {year_after}:")
            print(singles_after[['Artist', 'Single']])
            found_singles = True  # Exit the loop when found
        
        # Increment offset to check further years if no singles are found
        offset += 1

No singles found from the year 2000. Searching for the closest year...
Singles from the year 1997:
         Artist                                             Single
3    Elton John  "Something About the Way You Look Tonight"/"Ca...
22  Celine Dion                              "My Heart Will Go On"


In [49]:
# find singles and artists from the year when you were 15

age_15 = 2015 # year I was 15

singles_2015 = singles_merged[singles_merged['Released'] == age_15] # locate release year
print(singles_2015[['Artist', 'Single', 'Released']]) # see single and artist


                                Artist               Single  Released
11  Wiz Khalifa featuring Charlie Puth      "See You Again"      2015
16    Mark Ronson featuring Bruno Mars        "Uptown Funk"      2015
18                          Ed Sheeran  "Thinking Out Loud"      2015


## Space challenge

1. Make a single dataframe that combines the space missions from the 1950's to the 2020's
2. Write a script that returns the year with the most launches
3. Write a script that returns the most common month for launches
4. Write a script that ranks the months from most launches to fewest launches


In [67]:
# Starter code...

#%pip install lxml
import pandas as pd

spacedata = pd.read_html("https://en.wikipedia.org/wiki/Timeline_of_Solar_System_exploration")
# print(spacedata) # outputs list of various data frames

# combine all tables into data frame
len(spacedata) # check how many tables there are
space_all = pd.concat(spacedata, ignore_index = True)


In [73]:
# Script to return year with the most launches

# Step 1: Convert the 'Launch Date' column to datetime
space_all['Launch date'] = pd.to_datetime(space_all['Launch date'], format='%d %B %Y')
# Step 2: Isolate the year from the datetime column and create new column in df
space_all['Year'] = space_all['Launch date'].dt.year

missionsperyear = space_all.groupby('Year').size()
# Find the year with the most releases
year_with_most_missions = missionsperyear.idxmax()
most_missions_count = missionsperyear.max()
    

## Supervillain challenge

1. Write a script that combines the tables showing supervillain debuts from the 30's through the 2010's
2. Write a script that ranks each decade in terms of how many supervillains debuted in that decade
3. Write a script that ranks the different comics companies in terms of how many supervillains they have, and display the results in a nice table (pandas dataframe)

In [152]:
rawdata = pd.read_html("https://en.wikipedia.org/wiki/List_of_comic_book_supervillain_debuts")


## Stroop challenge

Every year between 2015 and 2021, the students in my Language, Cognition, and the Brain course participated in a version of the Stroop task. Using a stopwatch (ok, using their phones), they recorded how fast they could say a list of things (either reading or naming colors or color words). The column names mean "Reading with No Interference", "Naming with Interference", "Naming with No Interference", and "Reading with Interference". The times are in seconds.

### Stroop challenge 1: 
Transform these data from wide format to long format, so that the result is a dataframe with
- 1 column named "Participant_id" with a unique number for each participant (you can use the row indices)
- 1 column named "Year" with the year data
- 1 column named "Task" that shows which task they were doing
- 1 column named "RT" that shows their response time

In [153]:
# Starter code 1...

df = pd.read_csv("https://raw.githubusercontent.com/ethanweed/Stroop/master/Stroop-raw-over-the-years.csv")
df.head()

Unnamed: 0,Reading_NoInt,Naming_Int,Naming_NoInt,Reading_Int,Year
0,4.16,6.76,4.45,4.65,2015
1,4.35,7.73,4.78,4.46,2015
2,3.6,7.0,4.0,3.5,2015
3,3.9,9.03,4.6,6.3,2015
4,4.22,9.98,6.83,6.24,2015


In [154]:
# Starter code 2...

# Make a new column using the dataframe indices as particpant numbers

df.index.name = 'Participant_id'
df = df.reset_index()

#df.reset_index(inplace = True)
# NOTE: This line does exactly the same thing as the line above:
# it replaces the original df with a new df with the updated index. That's what
# "inplace = True" means. Or, you can just assign the dataframe with the updated index
# to a new dataframe with the same name as the old dataframe, which is what I did above.
# The end result is the same.

df

Unnamed: 0,Participant_id,Reading_NoInt,Naming_Int,Naming_NoInt,Reading_Int,Year
0,0,4.16,6.76,4.45,4.65,2015
1,1,4.35,7.73,4.78,4.46,2015
2,2,3.60,7.00,4.00,3.50,2015
3,3,3.90,9.03,4.60,6.30,2015
4,4,4.22,9.98,6.83,6.24,2015
...,...,...,...,...,...,...
177,177,4.30,7.08,6.25,4.28,2021
178,178,4.75,9.66,6.12,5.49,2021
179,179,4.98,7.52,6.73,5.16,2021
180,180,5.16,8.81,8.19,5.51,2021


## Stroop challenge 2 (Advanced!!!):

Make a new dataframe which shows the mean response time (in seconds) for each task for each year.