# **IS 362 – Final Project**
## **Exporting the Scraped Data as a CSV**

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests, time, datetime, re, random

## Web Scraping Nintendo Switch Sales Data

### Source: https://en.wikipedia.org/wiki/List_of_best-selling_Nintendo_Switch_video_games

In [2]:
# These objects are to help extract the data from the URL, depending on the web browser and operating system of
# the PC you are using, you would have to change the 'User-Agent' expression below. To do this, type in the Google
# search bar "What is my user agent?", and the browser will tell you. Copy and paste it below if necessary to
# replace it as 'User-Agent' : "[What your user agent is]"
wiki_url = "https://en.wikipedia.org/wiki/List_of_best-selling_Nintendo_Switch_video_games"
headers = {'User-Agent' : "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"}

# This is to test if the code is able to retrieve the code on the HTML page. Level 200 means "OK"
response = requests.get(wiki_url, headers=headers)
response.status_code

200

In [3]:
# parse data from the HTML page into a BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')
soup

# search for the table of Nintendo Switch sales
html_table = soup.find('table', {'class' : "wikitable"})
html_table

# create the DataFrame with the scraped data
switch_games_sales = pd.read_html(str(html_table))[0]
switch_games_sales

Unnamed: 0,Title,Copies sold,As of,Release date[a],Genre(s),Developer(s),Publisher(s)
0,Mario Kart 8 Deluxe,64.27 million[4],"September 30, 2024","April 28, 2017",Kart racing,Nintendo EPD,Nintendo
1,Animal Crossing: New Horizons,46.45 million[4],"September 30, 2024","March 20, 2020",Social simulation,Nintendo EPD,Nintendo
2,Super Smash Bros. Ultimate,35.14 million[4],"September 30, 2024","December 7, 2018",Fighting,Bandai Namco StudiosSora Ltd.,Nintendo
3,The Legend of Zelda: Breath of the Wild,32.29 million[4],"September 30, 2024","March 3, 2017",Action-adventure,Nintendo EPD,Nintendo
4,Super Mario Odyssey,28.50 million[4],"September 30, 2024","October 27, 2017",Platformer,Nintendo EPD,Nintendo
...,...,...,...,...,...,...,...
91,Fitness Boxing,1 million[43],"September 8, 2020","December 20, 2018",Exergamerhythm,Imagineer,JP: ImagineerNA/PAL: Nintendo
92,Fitness Boxing 2: Rhythm and Exercise,1 million[44],"December 9, 2021","December 4, 2020",Exergamerhythm,Imagineer,JP: ImagineerNA/PAL: Nintendo
93,Resident Evil 6,1 million[33],"March 31, 2024","October 29, 2019",Survival horror,Capcom,Capcom
94,Story of Seasons: Pioneers of Olive Town,1 million[45],"November 18, 2021","February 25, 2021",Simulationrole-playing,Marvelous,Xseed Games


The cell below is to help correct certain data formats in the DataFrame. When performing certain functions, such as sort_values, they may not read the values in the DataFrame the way you want it to. This is because they will prioritize the first character in the string of text in them, plus the data type in those columns are *recognized* as strings. To fix this, the functions below will parse the strings and convert them to the correct format.

In [4]:
def copies_sold_to_numeric(value):
    parsed_string = re.sub(r'million(\[[a-z]*[0-9]*\])*', '', value)
    return float(parsed_string) * 1000000

def convert_date(date):
               #  Month day, Year
    format_type = '%B %d, %Y'  # This is the current format
    datetime_str = datetime.datetime.strptime(date, format_type)
    return datetime_str.strftime("%m/%d/%Y") # mm/dd/yyyy

The cell below is to convert the formats using the functions from above.

In [5]:
switch_game_sales = switch_games_sales.assign(
    as_of = lambda df: df['As of'].map(lambda var:  convert_date(var)),
    release_date = lambda df: df['Release date[a]'].map(lambda var:  convert_date(var)),
    copies_sold = lambda df: df['Copies sold'].map(lambda var: copies_sold_to_numeric(var))
).drop(columns = ['Release date[a]', 'As of', 'Copies sold'], axis=1)

# Since the original columns with unformatted data were removed, I had to fix the new
# formatted ones with the correct header text as the original Wikipedia page, as well
# as put the columns back in the original order
switch_game_sales.rename(columns = {'as_of' : 'As of', 'release_date' : "Release date", "copies_sold" : "Copies sold"}, inplace=True)
switch_game_sales = switch_game_sales[['Title', 'Copies sold', 'As of', 'Release date', 'Genre(s)', 'Developer(s)', 'Publisher(s)']]

switch_game_sales

Unnamed: 0,Title,Copies sold,As of,Release date,Genre(s),Developer(s),Publisher(s)
0,Mario Kart 8 Deluxe,64270000.0,09/30/2024,04/28/2017,Kart racing,Nintendo EPD,Nintendo
1,Animal Crossing: New Horizons,46450000.0,09/30/2024,03/20/2020,Social simulation,Nintendo EPD,Nintendo
2,Super Smash Bros. Ultimate,35140000.0,09/30/2024,12/07/2018,Fighting,Bandai Namco StudiosSora Ltd.,Nintendo
3,The Legend of Zelda: Breath of the Wild,32290000.0,09/30/2024,03/03/2017,Action-adventure,Nintendo EPD,Nintendo
4,Super Mario Odyssey,28500000.0,09/30/2024,10/27/2017,Platformer,Nintendo EPD,Nintendo
...,...,...,...,...,...,...,...
91,Fitness Boxing,1000000.0,09/08/2020,12/20/2018,Exergamerhythm,Imagineer,JP: ImagineerNA/PAL: Nintendo
92,Fitness Boxing 2: Rhythm and Exercise,1000000.0,12/09/2021,12/04/2020,Exergamerhythm,Imagineer,JP: ImagineerNA/PAL: Nintendo
93,Resident Evil 6,1000000.0,03/31/2024,10/29/2019,Survival horror,Capcom,Capcom
94,Story of Seasons: Pioneers of Olive Town,1000000.0,11/18/2021,02/25/2021,Simulationrole-playing,Marvelous,Xseed Games


## Exporting the Web Scraped Data as a CSV

This to help maintain the web scraped data as a Relational file. But I won't be importing it as a DataFrame in this project, I will be using the DataFrame where I scraped data directly from the source.

In [6]:
switch_game_sales.to_csv('Switch_GameSales.csv', index=False)
print('Done')

Done
