## Pandas
pandas is an open source Python library for data analysis. Python has always been great for prepping and munging data, but it's never been great for analysis - you'd usually end up using R or loading it into a database and using SQL (or worse, Excel). pandas makes Python great for analysis.

## Data Structures
pandas introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy (this means it's fast).

In [3]:
import pandas as pd
import numpy as np
pd.set_option('max_columns', 50)

## Series
A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [4]:
# create a Series with an arbitrary list
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])
s

0                7
1       Heisenberg
2             3.14
3      -1789710578
4    Happy Eating!
dtype: object

Alternatively, you can specify an index to use when creating the Series.

In [5]:
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])
s

A                7
Z       Heisenberg
C             3.14
Y      -1789710578
E    Happy Eating!
dtype: object

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.

In [6]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
s = pd.Series(d)
s

Chicago          1000.0
New York         1300.0
Portland          900.0
San Francisco    1100.0
Austin            450.0
Boston              NaN
dtype: float64

You can use the index to select specific items from the Series ...

In [7]:
# Get just Chicago
s["Chicago"]

1000.0

In [12]:
# Get Chicago, Portland & San Francisco
s[["Chicago", "Portland", "San Francisco"]]

Chicago          1000.0
Portland          900.0
San Francisco    1100.0
dtype: float64

Or you can use boolean indexing for selection.

In [16]:
# Just get cities with less than 1000
s[s < 1000]

Portland    900.0
Austin      450.0
dtype: float64

That last one might be a little weird, so let's make it more clear - cities < 1000 returns a Series of True/False values, which we then pass to our Series cities, returning the corresponding True items.

In [18]:
less_than_1000 = s < 1000
print(less_than_1000)
print('\n')
print(s[less_than_1000])

Chicago          False
New York         False
Portland          True
San Francisco    False
Austin            True
Boston           False
dtype: bool


Portland    900.0
Austin      450.0
dtype: float64


You can also change the values in a Series on the fly.

In [20]:
# changing based on the index
print('Old value:', s['Chicago'])
s["Chicago"] = 400
print('New value:', s['Chicago'])

Old value: 1000.0
New value: 400.0


In [22]:
# changing values using boolean logic
print(s[s < 1000])
print('\n')
s[s < 1000] = 750
print(s[s < 1000])

Chicago     400.0
Portland    900.0
Austin      450.0
dtype: float64


Chicago     750.0
Portland    750.0
Austin      750.0
dtype: float64


What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

In [24]:
# Check if Seattle in the city list
print('Seattle' in s)
# Check if San Francisco in the city list
print('San Francisco' in s)

False
True


Mathematical operations can be done using scalars and functions.

In [25]:
# divide city values by 3
s / 3

Chicago          250.000000
New York         433.333333
Portland         250.000000
San Francisco    366.666667
Austin           250.000000
Boston                  NaN
dtype: float64

In [26]:
# square city values
s ** 2

Chicago           562500.0
New York         1690000.0
Portland          562500.0
San Francisco    1210000.0
Austin            562500.0
Boston                 NaN
dtype: float64

NULL checking can be performed with isnull and notnull.

In [30]:
# use boolean logic to grab the NULL cities
print(s.isnull())
print('\n')
print(s[s.isnull()])

Chicago          False
New York         False
Portland         False
San Francisco    False
Austin           False
Boston            True
dtype: bool


Boston   NaN
dtype: float64


## DataFrame
A DataFrame is a tablular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names).


## Reading Data
To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the DataFrame constructor.

Using the columns parameter allows us to tell the constructor how we'd like the columns ordered. By default, the DataFrame constructor will order the columns alphabetically (though this isn't the case when reading from a file - more on that next).

In [33]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)
#football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


## CSV

Reading a CSV is as simple as calling the read_csv function. By default, the read_csv function expects the column separator to be a comma, but you can change that using the sep parameter.

In [37]:
from_csv = pd.read_csv('mariano-rivera.csv')
from_csv.head(3)

Unnamed: 0,Year,Age,Tm,Lg,W,L,W-L%,ERA,G,GS,GF,CG,SHO,SV,IP,H,R,ER,HR,BB,IBB,SO,HBP,BK,WP,BF,ERA+,WHIP,H/9,HR/9,BB/9,SO/9,SO/BB,Awards
0,1995,25,NYY,AL,5,3,0.625,5.51,19,10,2,0,0,0,67.0,71,43,41,11,30,0,51,2,1,0,301,84,1.507,9.5,1.5,4.0,6.9,1.7,
1,1996,26,NYY,AL,8,3,0.727,2.09,61,0,14,0,0,5,107.2,73,25,25,1,34,3,130,2,0,1,425,240,0.994,6.1,0.1,2.8,10.9,3.82,CYA-3MVP-12
2,1997,27,NYY,AL,6,4,0.6,1.88,66,0,56,0,0,43,71.2,65,17,15,5,20,6,68,0,0,2,301,239,1.186,8.2,0.6,2.5,8.5,3.4,ASMVP-25


Our file had headers, which the function inferred upon reading in the file. Had we wanted to be more explicit, we could have passed header=None to the function along with a list of column names to use:

## Working with DataFrames
Now that we can get data into a DataFrame, we can finally start working with them. pandas has an abundance of functionality, far too much for me to cover in this introduction. I'd encourage anyone interested in diving deeper into the library to check out its excellent documentation. Or just use Google - there are a lot of Stack Overflow questions and blog posts covering specifics of the library.

We'll be using the MovieLens dataset in many examples going forward. The dataset contains 100,000 ratings made by 600 users on 9,000 movies.

In [73]:
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_csv('users.csv', header = None, names = u_cols)
users.head()



# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('movies.csv', header = None, names = m_cols, usecols = range(5))
movies['movie_id'] = pd.to_numeric(movies['movie_id'], errors = 'coerce').fillna(1)
movies.head()

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('rating.csv', header = None, name = r_cols)


TypeError: read_csv() got an unexpected keyword argument 'name'

## Inspection
pandas has a variety of functions for getting basic information about your DataFrame, the most basic of which is using the info method.

In [49]:
movies.info()

NameError: name 'movies' is not defined

The output tells a few things about our DataFrame.

It's obviously an instance of a DataFrame.
Each row was assigned an index of 0 to N-1, where N is the number of rows in the DataFrame. pandas will do this by default if an index is not specified. Don't worry, this can be changed later.
There are 1,682 rows (every row must have an index).
Our dataset has five total columns, one of which isn't populated at all (video_release_date) and two that are missing some values (release_date and imdb_url).
The last datatypes of each column, but not necessarily in the corresponding order to the listed columns. You should use the dtypes method to get the datatype for each column.
An approximate amount of RAM used to hold the DataFrame. See the .memory_usage method

In [None]:
movies.dtypes

DataFrame's also have a describe method, which is great for seeing basic statistics about the dataset's numeric columns. Be careful though, since this will return information on all columns of a numeric datatype.

In [None]:
users[['age']].describe()

We can quickly see the average age of our users is just above 34 years old, with the youngest being 7 and the oldest being 73. The median age is 31, with the youngest quartile of users being 25 or younger, and the oldest quartile being at least 43.

You've probably noticed that I've used the head method regularly throughout this post - by default, head displays the first five records of the dataset, while tail displays the last five.

In [None]:
movies.tail(10)

Use the .unique() function to get all the unique entries in a column

In [50]:
users['occupation'].unique()

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'],
      dtype=object)

## Selecting
You can think of a DataFrame as a group of Series that share an index (in this case the column headers). This makes it easy to select specific columns.

Selecting a single column from the DataFrame will return a Series object.

In [56]:
users['occupation'].head()
#users[['occupation']].head()

0    technician
1         other
2        writer
3    technician
4         other
Name: occupation, dtype: object

To select multiple columns, simply pass a list of column names to the DataFrame, the output of which will be a DataFrame.

In [54]:
print(users[['age', 'zip_code']].head())
print('\n')

# can also store in a variable to use later
columns_you_want = ['occupation', 'gender'] 



   age zip_code
0   24    85711
1   53    94043
2   23    32067
3   24    43537
4   33    15213




Row selection can be done multiple ways, but doing so by an individual index or boolean indexing are typically easiest.

In [64]:
# users older than 25
print(users[users['age'] > 25])
print('\n')

# users aged 40 AND male
print(users[(users['age'] == 40) & (users['gender' == 'M'])
print('\n')

# users younger than 30 OR female
print(users[(users['age'] < 30) & (users['gender' == 'F')])

SyntaxError: invalid syntax (485761633.py, line 7)

## Joining
Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a relational manner.

Our MovieLens data is a good example of this - a rating requires both a user and a movie, and the datasets are linked together by a key - in this case, the user_id and movie_id. It's possible for a user to be associated with zero or many ratings and movies. Likewise, a movie can be rated zero or many times, by a number of different users.

Like SQL's JOIN clause, pandas.merge allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (on, left_on, right_on, left_index, right_index) allowing you to specify the columns or indexes on which to join.

By default, pandas.merge operates as an inner join, which can be changed using the how parameter.

In [69]:
# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings, on = 'movie_id')
lens = pd.merge(movies_rating, users, on = 'users_id')
lens.head()

NameError: name 'movies' is not defined

## Grouping
Grouping in pandas can taks some time to grasp, but it's pretty awesome once it clicks.

pandas groupby method draws largely from the split-apply-combine strategy for data analysis. If you're not familiar with this methodology, I highly suggest you read up on it. It does a great job of illustrating how to properly think through a data problem, which I feel is more important than any technical skill a data analyst/scientist can possess.

When approaching a data analysis problem, you'll often break it apart into manageable pieces, perform some operations on each of the pieces, and then put everything back together again (this is the gist split-apply-combine strategy). pandas groupby is great for these problems (R users should check out the plyr and dplyr packages).

If you've ever used SQL's GROUP BY or an Excel Pivot Table, you've thought with this mindset, probably without realizing it.

We can use this to find the counts of reviews by each movie:

In [71]:
lens.groupby('title').size()

NameError: name 'lens' is not defined

Now lets take this data and use it to find the top 50 most reviewed movies. We're splitting the DataFrame into groups by movie title and applying the size method to get the count of records in each group. Next we need to order our results in descending order and limit the output to the top 50 using Python's slicing syntax.

In SQL, this would be equivalent to:

SELECT title, count(1)  
FROM lens  
GROUP BY title  
ORDER BY 2 DESC  
LIMIT 50;  


In [72]:
most_rated = lens.groupby('title').size().sort_values(ascending = False)[:50]
most_rated.head()

NameError: name 'lens' is not defined

## Pivoting
### Which movies do men and women most disagree on?
Think about how you'd have to do this in SQL for a second. You'd have to use a combination of IF/CASE statements with aggregate functions in order to pivot your dataset. Your query would look something like this:

SELECT title, AVG(IF(sex = 'F', rating, NULL)), AVG(IF(sex = 'M', rating, NULL))
FROM lens
GROUP BY title;

Imagine how annoying it'd be if you had to do this on more than two columns.

DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).


In [None]:
lens.reset_index('movie_id', inplace=True)

In [None]:
pivoted = lens.pivot_table(index=['movie_id', 'title'],
                           columns=['gender'],
                           values='rating',
                           fill_value= 0)
pivoted.head()

Next, calculate a difference column

In [None]:
pivoted['diff'] = pivoted['M'] - pivoted['F']
pivoted.head()

Finally just limit to top 50 movies and sort by our difference column

In [None]:
pivoted.reset_index('movie_id', inplace=True)

In [None]:
disagreements = pivoted[pivoted.movie_id.isin(most_50.index)]['diff'].sort_values()
disagreements

# Homework
## Undergrads & Grads

Provide the code to read in the global COVID-19 cases, death and recovery time-series data into three separate DataFrames.

In [103]:
import pandas as pd

# Read the data into DataFrames
cases_df = pd.read_csv('time_series_covid19_confirmed_global_long.csv')
deaths_df = pd.read_csv('time_series_covid19_deaths_global_long.csv')
recovery_df = pd.read_csv('time_series_covid19_recovered_global_long.csv')


# Use display() to show all three DataFrames' heads at once
from IPython.display import display

display(cases_df.head(3))
display(deaths_df.head(3))
display(recovery_df.head(3))

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed
0,,Afghanistan,33.93911,67.709953,1/22/2020,0
1,,Albania,41.1533,20.1683,1/22/2020,0
2,,Algeria,28.0339,1.6596,1/22/2020,0


Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Deaths
0,,Afghanistan,33.93911,67.709953,1/22/2020,0
1,,Albania,41.1533,20.1683,1/22/2020,0
2,,Algeria,28.0339,1.6596,1/22/2020,0


Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Recovered
0,,Afghanistan,33.93911,67.709953,1/22/2020,0
1,,Albania,41.1533,20.1683,1/22/2020,0
2,,Algeria,28.0339,1.6596,1/22/2020,0


How many rows are present in each DataFrame?

In [104]:
# Calculate the number of rows in each DataFrame using len()
num_rows_cases = len(cases_df)
num_rows_deaths = len(deaths_df)
num_rows_recovery = len(recovery_df)

# Print the results
print("Number of rows in cases_df:", num_rows_cases)
print("Number of rows in deaths_df:", num_rows_deaths)
print("Number of rows in recovery_df:", num_rows_recovery)


Number of rows in cases_df: 171585
Number of rows in deaths_df: 171585
Number of rows in recovery_df: 162360


Print the name of all the columns in the Cases DataFrame

In [105]:
# Print the names of all columns in the Cases DataFrame
print("Columns in cases_df:")
for column in cases_df.columns:
    print(column)

Columns in cases_df:
Province/State
Country/Region
Lat
Long
Date
Confirmed


Show the last 10 rows of the confirmed cases DataFrame

In [106]:
# Show the last 10 rows of the confirmed cases DataFrame
last_10_rows_cases = cases_df.tail(10)
print(last_10_rows_cases)

       Province/State      Country/Region        Lat        Long       Date  \
171575            NaN      United Kingdom  55.378100   -3.436000  9/27/2021   
171576            NaN             Uruguay -32.522800  -55.765800  9/27/2021   
171577            NaN          Uzbekistan  41.377491   64.585262  9/27/2021   
171578            NaN             Vanuatu -15.376700  166.959200  9/27/2021   
171579            NaN           Venezuela   6.423800  -66.589700  9/27/2021   
171580            NaN             Vietnam  14.058324  108.277199  9/27/2021   
171581            NaN  West Bank and Gaza  31.952200   35.233200  9/27/2021   
171582            NaN               Yemen  15.552727   48.516388  9/27/2021   
171583            NaN              Zambia -13.133897   27.849332  9/27/2021   
171584            NaN            Zimbabwe -19.015438   29.154857  9/27/2021   

        Confirmed  
171575    7701715  
171576     388572  
171577     172493  
171578          4  
171579     363300  
171580    

Print a list of unique countries/regions in the deaths data.

In [107]:
# Get a list of unique countries/regions in the deaths data
unique_countries_deaths = deaths_df['Country/Region'].unique()

# Print the list of unique countries/regions
print("Unique Countries/Regions in Deaths Data:")
for country in unique_countries_deaths:
    print(country)

Unique Countries/Regions in Deaths Data:
Afghanistan
Albania
Algeria
Andorra
Angola
Antigua and Barbuda
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Brunei
Bulgaria
Burkina Faso
Burma
Burundi
Cabo Verde
Cambodia
Cameroon
Canada
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo (Brazzaville)
Congo (Kinshasa)
Costa Rica
Cote d'Ivoire
Croatia
Cuba
Cyprus
Czechia
Denmark
Diamond Princess
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Eswatini
Ethiopia
Fiji
Finland
France
Gabon
Gambia
Georgia
Germany
Ghana
Greece
Grenada
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Holy See
Honduras
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Korea, South
Kosovo
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
MS Za

Merge confirmed cases, deaths and recoveries into one DataFrame (Hint: you will need to merge on these 5 columns: ['Province/State', 'Country/Region', 'Date', 'Lat', 'Long'])

In [108]:
# Merge cases_df and deaths_df based on the specified columns
merged_df = pd.merge(cases_df, deaths_df, on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long'])

# Merge the resulting DataFrame with recovery_df
merged_df = pd.merge(merged_df, recovery_df, on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long'])

# Display the first few rows of the merged DataFrame
display(merged_df.head())

display(merged_df.tail())

print('\n')
num_rows_merged = len(merged_df)
display("Number of rows in merged_df:", num_rows_merged)

print('\n')
print("Columns in merged_df:")
for merge_column in merged_df.columns:
    display(merge_column)

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.93911,67.709953,1/22/2020,0,0,0
1,,Albania,41.1533,20.1683,1/22/2020,0,0,0
2,,Algeria,28.0339,1.6596,1/22/2020,0,0,0
3,,Andorra,42.5063,1.5218,1/22/2020,0,0,0
4,,Angola,-11.2027,17.8739,1/22/2020,0,0,0


Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
158665,,Vietnam,14.058324,108.277199,9/27/2021,766051,18758,0
158666,,West Bank and Gaza,31.9522,35.2332,9/27/2021,398946,4046,0
158667,,Yemen,15.552727,48.516388,9/27/2021,8988,1703,0
158668,,Zambia,-13.133897,27.849332,9/27/2021,208867,3647,0
158669,,Zimbabwe,-19.015438,29.154857,9/27/2021,129919,4607,0






'Number of rows in merged_df:'

158670



Columns in merged_df:


'Province/State'

'Country/Region'

'Lat'

'Long'

'Date'

'Confirmed'

'Deaths'

'Recovered'

Create a new DataFrame from this merged data for just the latest date (Date equal to '9/27/2021')

In [109]:
# Filter the merged DataFrame for the latest date
latest_date = '9/27/2021'
latest_data_df = merged_df[merged_df['Date'] == latest_date]

# Print the first few rows of the new DataFrame
print(latest_data_df.head())

       Province/State Country/Region       Lat       Long       Date  \
158412            NaN    Afghanistan  33.93911  67.709953  9/27/2021   
158413            NaN        Albania  41.15330  20.168300  9/27/2021   
158414            NaN        Algeria  28.03390   1.659600  9/27/2021   
158415            NaN        Andorra  42.50630   1.521800  9/27/2021   
158416            NaN         Angola -11.20270  17.873900  9/27/2021   

        Confirmed  Deaths  Recovered  
158412     155072    7200          0  
158413     168188    2653          0  
158414     202877    5786          0  
158415      15189     130          0  
158416      55583    1513          0  


Print the top 5 Countries by deaths on the latest date (Hint: look up sort_values()). Display just the "Province/State", "Country/Region", and "Deaths" columns.

In [110]:
# Sort the DataFrame by 'Deaths' column in descending order
sorted_df = latest_data_df.sort_values(by='Deaths', ascending=False)

# Display only the specified columns: 'Province/State', 'Country/Region', and 'Deaths'
top_5_countries_deaths = sorted_df[['Province/State', 'Country/Region', 'Deaths']].head(5)

# Print the top 5 countries by deaths on the latest date
print(top_5_countries_deaths)

       Province/State Country/Region  Deaths
158645            NaN             US  690426
158442            NaN         Brazil  594653
158541            NaN          India  447373
158577            NaN         Mexico  275676
158610            NaN         Russia  201015


## Just Grads

Make a new DataFrame of the countries with the top 25 confirmed cases on the latest date.

In [111]:
# Sort the DataFrame by 'Confirmed' column in descending order
sorted_df_2 = latest_data_df.sort_values(by='Confirmed', ascending=False)

# Select the top 25 countries by confirmed cases
top_25_countries_confirmed = sorted_df_2.head(25)

# Print the new DataFrame with the top 25 countries by confirmed cases
print(top_25_countries_confirmed)

       Province/State  Country/Region        Lat        Long       Date  \
158645            NaN              US  40.000000 -100.000000  9/27/2021   
158541            NaN           India  20.593684   78.962880  9/27/2021   
158442            NaN          Brazil -14.235000  -51.925300  9/27/2021   
158660            NaN  United Kingdom  55.378100   -3.436000  9/27/2021   
158610            NaN          Russia  61.524010  105.318756  9/27/2021   
158644            NaN          Turkey  38.963700   35.243300  9/27/2021   
158524            NaN          France  46.227600    2.213700  9/27/2021   
158543            NaN            Iran  32.427908   53.688046  9/27/2021   
158418            NaN       Argentina -38.416100  -63.616700  9/27/2021   
158486            NaN        Colombia   4.570900  -74.297300  9/27/2021   
158630            NaN           Spain  40.463667   -3.749220  9/27/2021   
158547            NaN           Italy  41.871940   12.567380  9/27/2021   
158542            NaN    

Find the top 10 countries with the highest overall fatality rate of confirmed cases (total deaths divided by total cases)

In [112]:
# Group the data by 'Country/Region' and aggregate total cases and total deaths
grouped_df = merged_df.groupby('Country/Region').agg({'Confirmed': 'sum', 'Deaths': 'sum'})

# Calculate the fatality rate for each country
grouped_df['Fatality Rate'] = grouped_df['Deaths'] / grouped_df['Confirmed']

# Sort the DataFrame by the fatality rate in descending order
sorted_df_3 = grouped_df.sort_values(by='Fatality Rate', ascending=False)

# Select the top 10 countries with the highest fatality rates
top_10_countries_fatality = sorted_df_3.head(10)

# Print the new DataFrame with the top 10 countries by fatality rate
print(top_10_countries_fatality)

                Confirmed    Deaths  Fatality Rate
Country/Region                                    
MS Zaandam           4913      1090       0.221860
Yemen             1854834    409887       0.220983
Vanuatu               890       160       0.179775
Peru            614610886  58915483       0.095858
Mexico          813028025  72686321       0.089402
Sudan            12121592    825694       0.068118
Ecuador         132542843   7868038       0.059362
Egypt            84506766   4809672       0.056915
China            54999153   2704920       0.049181
Somalia           4165835    189802       0.045562


Of the top 25 countries on the latest data from above, calculate the difference in their monthly total confirmed cases for September and August 2021 and sort by this difference.

In [124]:
# Check the data type of the 'Date' column
date_column_type = merged_df['Date'].dtype

# Print the data type
display("Data type of 'Date' column:", date_column_type)
# dtype('O') means string

# Convert the 'Date' column to a datetime data type
merged_df['Date'] = pd.to_datetime(merged_df['Date'])

# Filter the data for August and September 2021
start_date = '2021-08-01'
end_date = '2021-09-30'

date_range_data = merged_df[(merged_df['Date'] >= start_date) & (merged_df['Date'] <= end_date)]

# Group the data by 'Country/Region' and aggregate total confirmed cases for each month
august_grouped = date_range_data[date_range_data['Date'].dt.month == 8].groupby('Country/Region')['Confirmed'].sum()
september_grouped = date_range_data[date_range_data['Date'].dt.month == 9].groupby('Country/Region')['Confirmed'].sum()

# Calculate the difference in total confirmed cases between September and August
difference_df = september_grouped - august_grouped

# Sort the DataFrame by the difference in descending order
sorted_difference_df = difference_df.sort_values(ascending=False)

# Select the top 25 countries and their differences
top_25_difference = sorted_difference_df.head(25)

# Print the top 25 countries by the difference in confirmed cases
print(top_25_difference)

"Data type of 'Date' column:"

dtype('O')

Country/Region
Malaysia          9778615
Thailand          9376404
Vietnam           7880611
Japan             7332155
Philippines       6258777
Iran              5333146
Cuba              4173962
Israel            2337011
Sri Lanka         1812842
Mongolia          1275036
Kazakhstan        1141131
Morocco            960007
Guatemala          943964
Australia          821771
Azerbaijan         758785
Burma              753729
United Kingdom     734763
Georgia            694815
Korea, South       487501
Kosovo             448651
Botswana           249948
Norway             240010
Jamaica            234747
Benin              223910
Mauritius          160641
Name: Confirmed, dtype: int64


0
