# Luca Viarengo's Data Cook Book

Here is my cookbook for my data final in CS-215

I will note here in the preface that I initially did my cookbook as part of my manifesto, showing code snippets and explaining the tools that I used there. Then I read that it needed to be in a Jupyter notebook with all of the appropriate runnable files... Which threw a wrench into my plan. So this cookbook will have some long explanation pieces for each code snippet that I found important, based on what I had written earlier in the Data Manifesto.

I put a lot of time and effort into writing those pieces for the cookbook in the Data Manifesto, which is I don't want to just wipe everything that I've already written. That being said, here is my Data Cook Book.

# Intro
In this Data Cook Book, I want to pivot from just talking about the tools and principles I’ve used in my data science career, which is what the Data Manifesto was all about. Instead, I want to show you actual examples of what I have done with these tools and principles. The following are just a few snapshots of code that I have used that was able to either manipulate or visualize the data in a very interesting way.

# Recipe #1: API Requests

To begin, we are going to start with accessing APIs. APIs are application programming interfaces, which essentially allows two programs to interact with one another.  Even before coming to this class, I had tinkered with APIs in the past, with some minor level of success. However this semester I wanted to interact with a Pokemon API on the internet, which would allow me to access and receive information about all the Pokemon that exist right on my Jupyter Notebook. The code that follows is the code that gets me the general information about the Pokemon API


In [12]:
#First we need to import the right libraries

# Lets us talk to other servers on the web
import requests

# APIs spit out data in JSON
import json

# DataFrames!
import pandas as pd
import numpy as np

In [13]:
# This api request gets me a list of the first 20 pokemon and the url to find their specific infomation

poke_url = "https://pokeapi.co/api/v2/pokemon"
poke_info = requests.get(poke_url).json()
print(poke_info)

{'count': 1281, 'next': 'https://pokeapi.co/api/v2/pokemon?offset=20&limit=20', 'previous': None, 'results': [{'name': 'bulbasaur', 'url': 'https://pokeapi.co/api/v2/pokemon/1/'}, {'name': 'ivysaur', 'url': 'https://pokeapi.co/api/v2/pokemon/2/'}, {'name': 'venusaur', 'url': 'https://pokeapi.co/api/v2/pokemon/3/'}, {'name': 'charmander', 'url': 'https://pokeapi.co/api/v2/pokemon/4/'}, {'name': 'charmeleon', 'url': 'https://pokeapi.co/api/v2/pokemon/5/'}, {'name': 'charizard', 'url': 'https://pokeapi.co/api/v2/pokemon/6/'}, {'name': 'squirtle', 'url': 'https://pokeapi.co/api/v2/pokemon/7/'}, {'name': 'wartortle', 'url': 'https://pokeapi.co/api/v2/pokemon/8/'}, {'name': 'blastoise', 'url': 'https://pokeapi.co/api/v2/pokemon/9/'}, {'name': 'caterpie', 'url': 'https://pokeapi.co/api/v2/pokemon/10/'}, {'name': 'metapod', 'url': 'https://pokeapi.co/api/v2/pokemon/11/'}, {'name': 'butterfree', 'url': 'https://pokeapi.co/api/v2/pokemon/12/'}, {'name': 'weedle', 'url': 'https://pokeapi.co/api/v

The Poke_url line here is the URL found on the website. Using that URL, I am able to use the python package of requests to “get” the information found in the Pokemon API in the form of a json string. This json string is then stored into poke_info, and the mess of code that you see afterwards is what is now stored in poke_info. That is the json that I pulled from the Pokemon API, which seems like it is a jumble of texts and curly brackets, but they can be easily accessed with square brackets. If I call poke_info[“count”], I get the count of all the pokemon, which in this case is 1281. This isn’t the only thing that I did with this API however. I wanted to try and see what info I could get from a certain Pokemon, so I found Charizards information and printed out what it contained.


In [14]:
# If you see in the poke_info, you see specific urls for certain pokemon, 
# as long as you get their number, which is their id
# For example, for bulbasaur, the url is "https://pokeapi.co/api/v2/pokemon/1/"
# I want to work with Charizard so I will add his specific number 6

# This is the example of using the first request to make additional requests, as the first request got me
# the url for Charizard specifically
charizard_url = ""
for i in range (0, len(poke_info["results"])):
    if (poke_info["results"][i]["name"] == "charizard"):
        charizard_url = poke_info["results"][i]["url"]
charizard_info = requests.get(charizard_url).json()
print(charizard_info)

{'abilities': [{'ability': {'name': 'blaze', 'url': 'https://pokeapi.co/api/v2/ability/66/'}, 'is_hidden': False, 'slot': 1}, {'ability': {'name': 'solar-power', 'url': 'https://pokeapi.co/api/v2/ability/94/'}, 'is_hidden': True, 'slot': 3}], 'base_experience': 267, 'forms': [{'name': 'charizard', 'url': 'https://pokeapi.co/api/v2/pokemon-form/6/'}], 'game_indices': [{'game_index': 180, 'version': {'name': 'red', 'url': 'https://pokeapi.co/api/v2/version/1/'}}, {'game_index': 180, 'version': {'name': 'blue', 'url': 'https://pokeapi.co/api/v2/version/2/'}}, {'game_index': 180, 'version': {'name': 'yellow', 'url': 'https://pokeapi.co/api/v2/version/3/'}}, {'game_index': 6, 'version': {'name': 'gold', 'url': 'https://pokeapi.co/api/v2/version/4/'}}, {'game_index': 6, 'version': {'name': 'silver', 'url': 'https://pokeapi.co/api/v2/version/5/'}}, {'game_index': 6, 'version': {'name': 'crystal', 'url': 'https://pokeapi.co/api/v2/version/6/'}}, {'game_index': 6, 'version': {'name': 'ruby', 'u

In this code, you see that I'm trying to find the string in the json that contains the API URL that's specifically for Charizard. I did this by searching through the entire json with the for loop, and once the result name was equal to Charizard, I stored the associated url in the variable charizard_url. This then gave me the url for Charizard’s data, which I then used the request package once again to retrieve Charizard’s specific information. The printed text is the data found in Charizard’s specific url on the Pokemon API, which I can now work with these values to answer any questions about Charizard I want. These are some of the basics on how to access free to use APIs, and it helped me finish one of our big projects of the semester.


# Recipe #2 Getting Timestamps

Sometimes, when trying to work with certain datasets, the question you ask has a lot to do with chronology. However, sometimes, accessing timestamps can be quite the challenge, because different datasets store date and time in different ways. At times, time, day, month, year, etc. are all their own columns, other times they are combined into one giant column. Sometimes the format is year/month/day, other times it's month/day/year. It might initially seem impossible to try and standardize time, however, python has a package for that. It is called Datetime, and it creates datetime objects that standardize time in the way that you personalize. An example of this is this datetime object is 2017-11-05 15:51:43. Here is an example of me using datetime objects to find exact timestamps.

In [15]:
#Need to import the datetime package
import datetime as dt

df = pd.read_json("post_comments.json")

#Here I am figuring out how to get my timestamp
time = df.iloc[0][0]["string_map_data"]["Time"]["timestamp"]
print(dt.datetime.fromtimestamp(time).strftime('%Y-%m-%d %H:%M:%S'))

2017-11-05 15:51:43


In [16]:
# This code gets us the comments and the timestamps
timestamps = []
comments = []
for i in range (0, len(df)):
    time = df.iloc[i][0]["string_map_data"]["Time"]["timestamp"]
    comment = df.iloc[i][0]["string_map_data"]["Comment"]["value"]
    timestamps.append(dt.datetime.fromtimestamp(time).strftime('%Y-%m-%d %H:%M:%S'))
    comments.append(comment)

#Here is a new data frame that just has the timestamps and the comments
updated_df = pd.read_json("post_comments.json")
updated_df["Comments"] = comments
updated_df["Timestamp"] = timestamps
updated_df = updated_df.drop("comments_media_comments", axis = 1)
updated_df

Unnamed: 0,Comments,Timestamp
0,@just_kushal Ty Kushal,2017-11-05 15:51:43
1,Lol,2017-10-06 20:26:57
2,He signed ur phone tho,2019-03-08 22:33:16
3,@swish_i,2019-03-08 18:01:33
4,@swish_i,2019-03-08 18:01:33
...,...,...
1197,Sam darnold fans for the week ðð»,2021-12-29 11:48:43
1198,#fakenews,2021-12-28 00:01:26
1199,@no_username_0123_ well obviously they both ar...,2021-12-26 20:12:27
1200,@no_username_0123_ not counting klay or wisema...,2021-12-26 19:47:51


In this piece of code, I am trying to extract both comments and timestamps from a json string that I got from my own personal Instagram data. In the for loop, I search through the data frame and for each row, I extract the comment and time from the json string, then using the dt.datetime.frometimestamp().strftime() function, I use the datetime package to create a timestamp object. The values in the .strftime() determine how the datetime package will personalize the timestamp object. I then add the value I get from the function to the timestamps list that I created earlier, then add that to the new dataframe I created called updated_df. You can see the results of using the Datetime package in the Timestamp column of the printed dataframe. The Datetime package just allows data scientists to be able to convert any time and date values into one, easy to read and personalize format, so that they can begin to manipulate timestamps as they please.


# Recipe #3: Max Value function
It is quite a common practice to look at the recurring values in any dataset to get a feel for the range of values in that dataset. It might be tempting to think about what is the value that occurs the most in a dataset, and try to build a question off of that. Pandas has a function for this called value_counts(), but in one of my projects, I wanted to go above and beyond and create a function in python that gets both the value that occurs the most in a column, and the name that is associated with that value, and I called it max_name_and_count().

In [17]:
#First we need to read in the csv file that contains the data that I used the function on
cars_df = pd.read_csv("Electric_Vehicle_Population_Data.csv")

#There were some data that was null
cars_df = cars_df.dropna(subset = ["Vehicle Location"])
cars_df.reset_index()

Unnamed: 0,index,VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
0,0,5YJ3E1EB4L,Yakima,Yakima,WA,98908.0,2020,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,322,0,14.0,127175366,POINT (-120.56916 46.58514),PACIFICORP,5.307700e+10
1,1,5YJ3E1EA7K,San Diego,San Diego,CA,92101.0,2019,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,220,0,,266614659,POINT (-117.16171 32.71568),,6.073005e+09
2,2,7JRBR0FL9M,Lane,Eugene,OR,97404.0,2021,VOLVO,S60,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,22,0,,144502018,POINT (-123.12802 44.09573),,4.103900e+10
3,3,5YJXCBE21K,Yakima,Yakima,WA,98908.0,2019,TESLA,MODEL X,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,289,0,14.0,477039944,POINT (-120.56916 46.58514),PACIFICORP,5.307700e+10
4,4,5UXKT0C5XH,Snohomish,Bothell,WA,98021.0,2017,BMW,X5,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,14,0,1.0,106314946,POINT (-122.18384 47.8031),PUGET SOUND ENERGY INC,5.306105e+10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124683,124711,5YJ3E1EB6N,Snohomish,Monroe,WA,98272.0,2022,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,0,39.0,192999061,POINT (-121.98087 47.8526),PUGET SOUND ENERGY INC,5.306105e+10
124684,124712,KNDCM3LD2L,Pierce,Tacoma,WA,98406.0,2020,KIA,NIRO,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,26,0,27.0,113346250,POINT (-122.52054 47.26887),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,5.305306e+10
124685,124713,7SAYGDEE0P,Whatcom,Bellingham,WA,98226.0,2023,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,0,42.0,232751305,POINT (-122.49756 48.7999),PUGET SOUND ENERGY INC||PUD NO 1 OF WHATCOM CO...,5.307300e+10
124686,124714,1G1FW6S03J,Pierce,Tacoma,WA,98444.0,2018,CHEVROLET,BOLT EV,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,238,0,29.0,102589007,POINT (-122.46495 47.16778),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,5.305307e+10


In [18]:
#This function returns a list with a string for the max count, and a string for the name of whatever had the max count
#Prerequisite: The data must be sorted on the column that is the parameter column_name.
def max_name_and_count(df, column_name): 
    max_count = 0
    current_count = 0
    current_name = df.iloc[0][column_name]
    max_name = df.iloc[0][column_name]
    
    for value in range (len(df[column_name])):
        if (current_name == df.iloc[value][column_name]):
            current_count = current_count + 1
        else:
            current_name = df.iloc[value][column_name]
            current_count = 1
        if(current_count > max_count):
            max_name = df.iloc[value][column_name]
            max_count = current_count
    return [max_name, str(max_count)]

In [19]:
#Now lets show this function in ACTION

#I first needed to organize the dataframe by the city column
cars_df = cars_df.sort_values(by = ["City"], axis = 0)

#Using our function from earlier, max_values will be a list with the name of the max city and count of the max vehicle
max_values = max_name_and_count(cars_df, "City")

print("The city with the most electric vehicles registered in the state of Washington is " 
          + max_values[0] + " with a count of " + max_values[1])

The city with the most electric vehicles registered in the state of Washington is Seattle with a count of 22006


This function accepts a dataframe and a column name as its initial parameters, then sets up some temporary variables that will start to keep track of the maximum count, current count, name associated with the maximum count, and name associated with the current count. The df.iloc[value][column_name] essentially just picks out a row at index value, and returns the value at the column_name of that row. This is important because the way this function works is it begins a for loop that traverses the entire dataframe one row at a time. It first checks if the current_name is equal to the name of the row we are on. If it is, then we add one to our current count. If not, then we reset the current_count to 1, and update the current_name to its new value. Then we finally check if the current_count is greater than the max_count. If it is, then we update max_name to get the value of the current_name, and max_count gets the value of the count. Once the for loop has traversed through the entire dataframe, the function returns a list with the name of the value with the highest count and the maximum count. This function only works if the dataset is sorted by the column that was passed into the area, so it's an important prerequisite.

# Recipe 3 part 2
We can also use pandas value_counts() to get a similar result

In [26]:
#To confirm that the earlier number is accurate, we can use value_counts()
#I could have used value_counts from the start, but to show my skills as a computer scientist and coder, 
#I wanted to find the value in multiple ways
print(cars_df["City"].value_counts())

#HOWEVER, value_counts doesn't really seem to allow me to access the name of which city has the highest count
#which is why my function does help me, even if it is a minor way.

Seattle           22006
Bellevue           6489
Redmond            4646
Vancouver          4464
Kirkland           3923
                  ...  
Mascoutah             1
Seaside               1
Canoga Park           1
Maryhill              1
Smiths Station        1
Name: City, Length: 647, dtype: int64


I know that technically, the function value_counts returns a similar value, but it doesn’t give me the name of the value with the highest count in a format in which I can then use it in other code, which my function does return. Overall, I am proud that I was able to create this general function, which helped me greatly in my final project in this class.

# Recipe #4 Observable
For this final recipe, we need to move to my observable notebook to show off a very cool data visualization that I created all on my own.