<a href="https://colab.research.google.com/github/ShaunakSen/Data-Science-and-Machine-Learning/blob/master/Pandas_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Common Pandas Tasks and Applications for Interviews

> https://www.stratascratch.com/blog/python-pandas-interview-questions-for-data-science/

---




In [None]:
import pandas as pd
import numpy as np
from google.colab import data_table
data_table.enable_dataframe_formatter()

### Basic operations

In [None]:
covid_df = pd.read_csv('https://data.covid19india.org/csv/latest/districts.csv')

In [None]:
data_table.enable_dataframe_formatter()
covid_df.head()

Unnamed: 0,Date,State,District,Confirmed,Recovered,Deceased,Other,Tested
0,2020-04-26,Andaman and Nicobar Islands,Unknown,33,11,0,0,
1,2020-04-26,Andhra Pradesh,Anantapur,53,14,4,0,
2,2020-04-26,Andhra Pradesh,Chittoor,73,13,0,0,
3,2020-04-26,Andhra Pradesh,East Godavari,39,12,0,0,
4,2020-04-26,Andhra Pradesh,Guntur,214,29,8,0,


In [None]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358633 entries, 0 to 358632
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date       358633 non-null  object 
 1   State      358633 non-null  object 
 2   District   358633 non-null  object 
 3   Confirmed  358633 non-null  int64  
 4   Recovered  358633 non-null  int64  
 5   Deceased   358633 non-null  int64  
 6   Other      358633 non-null  int64  
 7   Tested     270698 non-null  float64
dtypes: float64(1), int64(4), object(3)
memory usage: 21.9+ MB


> Only `tested` column has missing values

Where DataFrames differ from a SQL table or an Excel Sheet is in flexibility of the row identifiers – The index. At present, the index contains sequential values just as with SQL Tables or a Spreadsheet. 

However, we can create our own row values and they __need not be unique or sequential__. For example, we can set the State Column to be the index using the set_index() method.



In [None]:
state_idx = covid_df.set_index('State')

state_idx.head()

Unnamed: 0_level_0,Date,District,Confirmed,Recovered,Deceased,Other,Tested
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Andaman and Nicobar Islands,2020-04-26,Unknown,33,11,0,0,
Andhra Pradesh,2020-04-26,Anantapur,53,14,4,0,
Andhra Pradesh,2020-04-26,Chittoor,73,13,0,0,
Andhra Pradesh,2020-04-26,East Godavari,39,12,0,0,
Andhra Pradesh,2020-04-26,Guntur,214,29,8,0,


The state field is no longer considered a column. When we call the info() method on the new DataFrame, we can see that the state column from our older dataset is now set as the index.



In [None]:
state_idx.info()

<class 'pandas.core.frame.DataFrame'>
Index: 358633 entries, Andaman and Nicobar Islands to West Bengal
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date       358633 non-null  object 
 1   District   358633 non-null  object 
 2   Confirmed  358633 non-null  int64  
 3   Recovered  358633 non-null  int64  
 4   Deceased   358633 non-null  int64  
 5   Other      358633 non-null  int64  
 6   Tested     270698 non-null  float64
dtypes: float64(1), int64(4), object(2)
memory usage: 21.9+ MB


We can get back a sequential Index by calling the reset_index() method.

Each column of data in a Pandas DataFrame is referred to as a Pandas Series. Every time we access a Pandas Series, we also get the index along. Example, if we take the District Series, we will get a column with State field as the index.



In [None]:
state_idx['District'].head()

State
Andaman and Nicobar Islands          Unknown
Andhra Pradesh                     Anantapur
Andhra Pradesh                      Chittoor
Andhra Pradesh                 East Godavari
Andhra Pradesh                        Guntur
Name: District, dtype: object

Each column of data in a Pandas DataFrame is referred to as a Pandas Series. Every time we access a Pandas Series, we also get the index along. Example, if we take the District Series, we will get a column with State field as the index.



In [None]:
pd.DataFrame(covid_df['State'].value_counts())

Unnamed: 0,State
Uttar Pradesh,41388
Madhya Pradesh,28597
Tamil Nadu,22053
Bihar,20951
Rajasthan,20408
Maharashtra,19907
Gujarat,18792
Karnataka,16972
Odisha,16765
Chhattisgarh,15556


Note: it will ignore missing values in the frequency count. So, if you want to include missing values, set the dropna argument to False. In this case, there were no missing values so we omitted this argument.

### Sorting

As with SQL and Spreadsheets, we can sort the table based on a column or a sequence of columns. By default the values are sorted in ascending order, we can change the order by changing the ascending parameter. Example we can sort our Covid Dataset by the descending order of the State Names but Ascending order of the District Names.



In [None]:
covid_df.sort_values(by=['State', 'District'], ascending = [False, True])



Unnamed: 0,Date,State,District,Confirmed,Recovered,Deceased,Other,Tested
17447,2020-05-29,West Bengal,Alipurduar,4,0,0,0,
18053,2020-05-30,West Bengal,Alipurduar,4,0,0,0,
18661,2020-05-31,West Bengal,Alipurduar,4,0,0,0,
19293,2020-06-01,West Bengal,Alipurduar,5,0,0,0,
19925,2020-06-02,West Bengal,Alipurduar,5,0,0,0,
...,...,...,...,...,...,...,...,...
355336,2021-10-27,Andaman and Nicobar Islands,Unknown,7650,7516,129,0,
355996,2021-10-28,Andaman and Nicobar Islands,Unknown,7651,7516,129,0,
356656,2021-10-29,Andaman and Nicobar Islands,Unknown,7651,7517,129,0,
357315,2021-10-30,Andaman and Nicobar Islands,Unknown,7651,7518,129,0,


### Duplicates


Dealing with duplicates is a very common problem encountered in Data Science Interview questions. The presence of duplicates need not mean incorrect data. For example, a customer might purchase multiple items, hence the transaction data might contain repeated values of the same card number or customer id. Pandas provides a convenient way of dropping duplicates and creating a unique set of records. Once can apply the `drop_duplicates()` method for this purpose.

Suppose we want to find the list of distinct states from our Covid Dataset. We can do it using passing the entire series to a Python set and then converting it back into a Pandas Series. Conversely, we can apply the `drop_duplicates()` method. We can either apply it to the Pandas Series,





In [None]:
covid_df['State'].drop_duplicates()

0                      Andaman and Nicobar Islands
1                                   Andhra Pradesh
13                               Arunachal Pradesh
14                                           Assam
15                                           Bihar
37                                      Chandigarh
38                                    Chhattisgarh
43                                           Delhi
44                                             Goa
45                                         Gujarat
75                                Himachal Pradesh
81                                         Haryana
101                                      Jharkhand
111                              Jammu and Kashmir
128                                      Karnataka
149                                         Kerala
163                                         Ladakh
165                                    Maharashtra
198                                      Meghalaya
199                            

We can apply it over the entire dataset and specify a subset of columns

In [None]:
covid_df.drop_duplicates(subset = ['State'])


Unnamed: 0,Date,State,District,Confirmed,Recovered,Deceased,Other,Tested
0,2020-04-26,Andaman and Nicobar Islands,Unknown,33,11,0,0,
1,2020-04-26,Andhra Pradesh,Anantapur,53,14,4,0,
13,2020-04-26,Arunachal Pradesh,Lohit,1,1,0,0,
14,2020-04-26,Assam,Unknown,36,27,1,0,
15,2020-04-26,Bihar,Arwal,4,0,0,0,
37,2020-04-26,Chandigarh,Chandigarh,36,17,0,0,756.0
38,2020-04-26,Chhattisgarh,Bilaspur,1,1,0,0,
43,2020-04-26,Delhi,Delhi,2918,877,54,0,37613.0
44,2020-04-26,Goa,Unknown,7,7,0,0,
45,2020-04-26,Gujarat,Ahmedabad,2181,140,104,0,


Let us try to apply this in one of the Python Pandas interview questions. This one came up in an AirBnB Data Science Interview.

### Python Pandas Interview Questions


> https://platform.stratascratch.com/coding/9626-find-all-neighborhoods-present-in-this-dataset?python=1

### Slicing Pandas Dataset


You can also subset specific rows based on their values, like the WHERE condition in SQL or Filter option in Spreadsheets. Suppose we want only data from the states of Goa and Maharashtra. We can simply pass this as a condition and slice the DataFrame as one would slice a string or a list.




In [None]:
(covid_df['State'] == 'Goa') | (covid_df['State'] == 'Maharashtra')

0         False
1         False
2         False
3         False
4         False
          ...  
358628    False
358629    False
358630    False
358631    False
358632    False
Name: State, Length: 358633, dtype: bool

When we pass this into the slicer, only the rows with True are retained. This is a very efficient and powerful way of filtering data.

In [None]:
covid_df[(covid_df['State'] == 'Goa') | (covid_df['State'] == 'Maharashtra')]



Unnamed: 0,Date,State,District,Confirmed,Recovered,Deceased,Other,Tested
44,2020-04-26,Goa,Unknown,7,7,0,0,
165,2020-04-26,Maharashtra,Ahmednagar,36,22,2,0,
166,2020-04-26,Maharashtra,Akola,29,7,1,0,
167,2020-04-26,Maharashtra,Amravati,20,4,1,0,
168,2020-04-26,Maharashtra,Aurangabad,50,22,5,0,2431.0
...,...,...,...,...,...,...,...,...
358283,2021-10-31,Maharashtra,Solapur,210466,204364,5551,110,629266.0
358284,2021-10-31,Maharashtra,Thane,610128,597141,11462,35,1229625.0
358285,2021-10-31,Maharashtra,Wardha,57344,55956,1217,165,52365.0
358286,2021-10-31,Maharashtra,Washim,41663,41020,637,3,


### Question

> https://platform.stratascratch.com/coding/9995-top-10-ranked-songs?python=1

Soln:

```python
spotify_worldwide_daily_song_ranking[spotify_worldwide_daily_song_ranking['position'] <= 10][['position', 'trackname']].sort_values(by=['position', 'trackname'], ascending=[False, True]).drop_duplicates()
```



### Aggregations


Let’s say we wanted to find the average number of confirmed cases reported in each district. We can do this by passing the State and District combination as the grouper variables and then calculate the average by calling the mean() method.

In [None]:
data_table.disable_dataframe_formatter()
covid_df.groupby(['State', 'District']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Confirmed,Recovered,Deceased,Other,Tested
State,District,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Andaman and Nicobar Islands,Unknown,4623.552347,4440.752708,68.074007,0.027076,
Andhra Pradesh,Anantapur,79606.351986,77166.389892,602.866426,0.000000,606774.047170
Andhra Pradesh,Chittoor,109870.144404,104801.902527,911.308664,0.000000,608569.737736
Andhra Pradesh,East Godavari,140476.093863,133979.236462,664.498195,0.000000,737638.858491
Andhra Pradesh,Foreign Evacuees,418.531429,399.714286,0.000000,0.000000,
...,...,...,...,...,...,...
West Bengal,Purba Medinipur,26824.191336,25649.314079,224.256318,0.000000,
West Bengal,Purulia,9545.137667,9163.288719,55.722753,0.000000,
West Bengal,South 24 Parganas,44986.563177,42874.063177,669.989170,0.000000,
West Bengal,Unknown,168.400000,52.333333,13.000000,0.000000,


As you can see from the result, the State and District combination now becomes an index. In Pandas this is called a Multi-Index Data. You can easily get the index values back into the DataFrame by calling the reset_index() method. Or you can prevent the creation of this index by setting the as_index argument to False in the groupby method.



In [None]:
covid_df.groupby(['State', 'District'], as_index = False).mean()

Unnamed: 0,State,District,Confirmed,Recovered,Deceased,Other,Tested
0,Andaman and Nicobar Islands,Unknown,4623.552347,4440.752708,68.074007,0.027076,
1,Andhra Pradesh,Anantapur,79606.351986,77166.389892,602.866426,0.000000,606774.047170
2,Andhra Pradesh,Chittoor,109870.144404,104801.902527,911.308664,0.000000,608569.737736
3,Andhra Pradesh,East Godavari,140476.093863,133979.236462,664.498195,0.000000,737638.858491
4,Andhra Pradesh,Foreign Evacuees,418.531429,399.714286,0.000000,0.000000,
...,...,...,...,...,...,...,...
676,West Bengal,Purba Medinipur,26824.191336,25649.314079,224.256318,0.000000,
677,West Bengal,Purulia,9545.137667,9163.288719,55.722753,0.000000,
678,West Bengal,South 24 Parganas,44986.563177,42874.063177,669.989170,0.000000,
679,West Bengal,Unknown,168.400000,52.333333,13.000000,0.000000,


This gives us the grouper variable as columns in the DataFrame and we can use them for further analysis without having to resort to reset_index() again and again. Further, you would have noticed that Pandas calculates the mean of all numeric variables. We can retain the relevant columns by subsetting the final dataset. Or we can simply call the agg() method on a grouped DataFrame.



In [None]:
covid_df.groupby(['State', 'District'], as_index = False).agg({'Confirmed' : 'mean'})

Unnamed: 0,State,District,Confirmed
0,Andaman and Nicobar Islands,Unknown,4623.552347
1,Andhra Pradesh,Anantapur,79606.351986
2,Andhra Pradesh,Chittoor,109870.144404
3,Andhra Pradesh,East Godavari,140476.093863
4,Andhra Pradesh,Foreign Evacuees,418.531429
...,...,...,...
676,West Bengal,Purba Medinipur,26824.191336
677,West Bengal,Purulia,9545.137667
678,West Bengal,South 24 Parganas,44986.563177
679,West Bengal,Unknown,168.400000


The agg() method is also advantageous in calculating multiple values aggregates at the same time. Suppose We want to report not only the average of the confirmed cases. but also, the maximum number of deaths, we can pass these as a dictionary to the agg() method.

In [None]:
covid_df.groupby(['State', 'District'], as_index = False).agg({'Confirmed' : 'mean', 'Deceased' : 'max'})

Unnamed: 0,State,District,Confirmed,Deceased
0,Andaman and Nicobar Islands,Unknown,4623.552347,129
1,Andhra Pradesh,Anantapur,79606.351986,1093
2,Andhra Pradesh,Chittoor,109870.144404,1947
3,Andhra Pradesh,East Godavari,140476.093863,1290
4,Andhra Pradesh,Foreign Evacuees,418.531429,0
...,...,...,...,...
676,West Bengal,Purba Medinipur,26824.191336,397
677,West Bengal,Purulia,9545.137667,113
678,West Bengal,South 24 Parganas,44986.563177,1336
679,West Bengal,Unknown,168.400000,48


### Merging

#### Concat and Append

> https://pythonprogramming.net/concatenate-append-data-analysis-python-pandas-tutorial/






In [None]:
df1 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])

df2 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2005, 2006, 2007, 2008])

df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

In [None]:
df1.head()

Unnamed: 0,HPI,Int_rate,US_GDP_Thousands
2001,80,2,50
2002,85,3,55
2003,88,2,65
2004,85,2,55


In [None]:
df2.head()

Unnamed: 0,HPI,Int_rate,US_GDP_Thousands
2005,80,2,50
2006,85,3,55
2007,88,2,65
2008,85,2,55


In [None]:
df3.head()

Unnamed: 0,HPI,Int_rate,Low_tier_HPI
2001,80,2,50
2002,85,3,52
2003,88,2,50
2004,85,2,53


Notice there are two major changes between these. df1 and df3 have the same index, but they have some different columns. df2 and df3 have different indexes and some differing columns. With concatenation, we can talk about various methods of bringing these together. Let's try a simple concatenation:



In [None]:
concat = pd.concat([df1,df2])
concat

Unnamed: 0,HPI,Int_rate,US_GDP_Thousands
2001,80,2,50
2002,85,3,55
2003,88,2,65
2004,85,2,55
2005,80,2,50
2006,85,3,55
2007,88,2,65
2008,85,2,55


Easy enough. The major difference between these was merely a continuation of the index, but they shared the same columns. Now they have become a single dataframe. In our case, however, we're curious about adding columns, not rows. What happens when we combine some shared and some new:



In [None]:
concat = pd.concat([df1,df2,df3])
concat

Unnamed: 0,HPI,Int_rate,US_GDP_Thousands,Low_tier_HPI
2001,80,2,50.0,
2002,85,3,55.0,
2003,88,2,65.0,
2004,85,2,55.0,
2005,80,2,50.0,
2006,85,3,55.0,
2007,88,2,65.0,
2008,85,2,55.0,
2001,80,2,,50.0
2002,85,3,,52.0


In [None]:
df4 = df1.append(df2)
df4

Unnamed: 0,HPI,Int_rate,US_GDP_Thousands
2001,80,2,50
2002,85,3,55
2003,88,2,65
2004,85,2,55
2005,80,2,50
2006,85,3,55
2007,88,2,65
2008,85,2,55


That's what we expect with an append. In most cases, you are going to do something like this, as if you're inserting a new row in a database. Dataframes were not really made to be appended efficiently, they are meant moreso to be manipulated based on their starting data, but you can append if you need to. What happens when we append data with the same index?



In [None]:
df4 = df1.append(df3)
df4

Unnamed: 0,HPI,Int_rate,US_GDP_Thousands,Low_tier_HPI
2001,80,2,50.0,
2002,85,3,55.0,
2003,88,2,65.0,
2004,85,2,55.0,
2001,80,2,,50.0
2002,85,3,,52.0
2003,88,2,,50.0
2004,85,2,,53.0


#### Join and Merge



In [None]:
df1

Unnamed: 0,HPI,Int_rate,US_GDP_Thousands
2001,80,2,50
2002,85,3,55
2003,88,2,65
2004,85,2,55


In [None]:
df3

Unnamed: 0,HPI,Int_rate,Low_tier_HPI
2001,80,2,50
2002,85,3,52
2003,88,2,50
2004,85,2,53


In [None]:
pd.merge(df1,df3, on='HPI')

Unnamed: 0,HPI,Int_rate_x,US_GDP_Thousands,Int_rate_y,Low_tier_HPI
0,80,2,50,2,50
1,85,3,55,3,52
2,85,3,55,2,53
3,85,2,55,3,52
4,85,2,55,2,53
5,88,2,65,2,50


In [None]:
pd.merge(df1,df2, on=['HPI','Int_rate'])

Unnamed: 0,HPI,Int_rate,US_GDP_Thousands_x,US_GDP_Thousands_y
0,80,2,50,50
1,85,3,55,55
2,88,2,65,65
3,85,2,55,55


Notice here there are two versions of the US_GDP_Thousands. This is because we didn't share on these columns, so both are retained, with another letter to differentiate. Remember before I was saying that Pandas is a great module to marry to a database like mysql? here's why.

Generally, with databases, you want to keep them as lightweight as possible so the queries that run on them can execute as fast as possible.

Let's say you run a website like pythonprogramming.net, where you have users, so you definitely want to track username and encrypted password hashes, so that's 2 columns for sure. Maybe then you have a login name, a username, a password, an email and a join date. So that's already 5 columns with basic data points. Then maybe you have something like user settings, posts if you have a forum, completed tutorials. Then maybe you want to have settings like admin, moderator, regular user.

The lists can go on and on. If you have literally just 1 massive table, this can work, but it might also be better to distribute the table, since many operations will simply be much quicker and more efficient. After merging, you would probably set the new index. Something like this:



In [None]:
df4 = pd.merge(df1,df3, on='HPI')
df4.set_index('HPI', inplace=True)
df4

Unnamed: 0_level_0,Int_rate_x,US_GDP_Thousands,Int_rate_y,Low_tier_HPI
HPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
80,2,50,2,50
85,3,55,3,52
85,3,55,2,53
85,2,55,3,52
85,2,55,2,53
88,2,65,2,50


merge() joins by columns

But what if the columns we want to join by are the index? - we use join()



In [None]:
df1

Unnamed: 0_level_0,Int_rate,US_GDP_Thousands
HPI,Unnamed: 1_level_1,Unnamed: 2_level_1
80,2,50
85,3,55
88,2,65
85,2,55


In [None]:
df3

Unnamed: 0_level_0,Int_rate,Low_tier_HPI
HPI,Unnamed: 1_level_1,Unnamed: 2_level_1
80,2,50
85,3,52
88,2,50
85,2,53


In [None]:
df1.set_index('HPI', inplace=True)
df3.set_index('HPI', inplace=True)

joined = df1.join(df3, on=['HPI'], lsuffix='1', rsuffix='2')
joined

Unnamed: 0_level_0,Int_rate1,US_GDP_Thousands,Int_rate2,Low_tier_HPI
HPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
80,2,50,2,50
85,3,55,3,52
85,3,55,2,53
88,2,65,2,50
85,2,55,3,52
85,2,55,2,53


Now, let's consider joining and merging on slightly differing indexes. Let's redefine dataframes with df1 and df3 starting, turning them into:



In [None]:
df1 = pd.DataFrame({
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55],
                    'Year':[2001, 2002, 2003, 2004]
                    })

df3 = pd.DataFrame({
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53],
                    'Year':[2001, 2003, 2004, 2005]})

In [None]:
df1

Unnamed: 0,Int_rate,US_GDP_Thousands,Year
0,2,50,2001
1,3,55,2002
2,2,65,2003
3,2,55,2004


In [None]:
df3

Unnamed: 0,Unemployment,Low_tier_HPI,Year
0,7,50,2001
1,8,52,2003
2,9,50,2004
3,6,53,2005


Here, we now have similar year columns, but different dates. df3 has 2005 but not 2002, and df1 is the reverse of that. Now, what happens when we merge?

In [None]:
merged = pd.merge(df1,df3, on='Year')
merged

Unnamed: 0,Int_rate,US_GDP_Thousands,Year,Unemployment,Low_tier_HPI
0,2,50,2001,7,50
1,2,65,2003,8,52
2,2,55,2004,9,50


Notice how 2005 and 2002 are just totally missing? Merge will natively just merge existing/shared data. What can we do about this? It turns out, there is a "how" parameter when merging. This parameter reflects the merging choices that come from merging databases. You have the following choices: Left, right, outer inner.

```
Left - equal to left outer join SQL - use keys from left frame only
Right - right outer join from SQL- use keys from right frame only.
Outer - full outer join - use union of keys
Inner - use only intersection of keys.
```

In [None]:
merged = pd.merge(df1,df3, on='Year', how='left')
merged.set_index('Year', inplace=True)
merged

Unnamed: 0_level_0,Int_rate,US_GDP_Thousands,Unemployment,Low_tier_HPI
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2001,2,50,7.0,50.0
2002,3,55,,
2003,2,65,8.0,52.0
2004,2,55,9.0,50.0


#### Summarize

> https://stackoverflow.com/questions/15819050/pandas-dataframe-concat-vs-append (answer by Mohsin Mahmood)

- Concat gives the flexibility to join based on the axis( all rows or all columns)

- Append is the specific case(axis=0, join='outer') of concat

- Join is based on the indexes (set by set_index) on how variable =['left','right','inner','couter']

- Merge is based on any particular column each of the two dataframes, this columns are variables on like 'left_on', 'right_on', 'on'

## Practice Questions



### Naive Forecasting

> https://platform.stratascratch.com/coding/10313-naive-forecasting?code_type=1

In [1]:
import pandas as pd
from datetime import datetime

In [2]:
# create dummy data

In [69]:
# Sample data
data = {
    'request_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'request_date': [
        datetime(2023, 1, 15, 8, 30),
        datetime(2023, 2, 1, 9, 0),
        datetime(2023, 3, 10, 9, 30),
        datetime(2023, 4, 5, 10, 0),
        datetime(2023, 5, 20, 10, 30),
        datetime(2023, 6, 15, 11, 0),
        datetime(2023, 7, 1, 11, 30),
        datetime(2023, 8, 12, 12, 0),
        datetime(2023, 9, 23, 12, 30),
        datetime(2023, 10, 5, 13, 0)
    ],
    'request_status': ['completed', 'pending', 'cancelled', 'completed', 'pending', 'completed', 'pending', 'completed', 'cancelled', 'completed'],
    'distance_to_travel': [12.5, 7.0, 5.2, 20.3, 14.1, 10.0, 8.5, 15.0, 12.0, 18.0],
    'monetary_cost': [25.0, 15.0, 10.0, 40.0, 28.0, 20.0, 17.0, 30.0, 24.0, 36.0],
    'driver_to_client_distance': [1.2, 2.5, 0.8, 3.0, 1.0, 2.0, 1.5, 2.8, 1.1, 1.9]
}

# Creating DataFrame
df = pd.DataFrame(data)

In [70]:
df

Unnamed: 0,request_id,request_date,request_status,distance_to_travel,monetary_cost,driver_to_client_distance
0,1,2023-01-15 08:30:00,completed,12.5,25.0,1.2
1,2,2023-02-01 09:00:00,pending,7.0,15.0,2.5
2,3,2023-03-10 09:30:00,cancelled,5.2,10.0,0.8
3,4,2023-04-05 10:00:00,completed,20.3,40.0,3.0
4,5,2023-05-20 10:30:00,pending,14.1,28.0,1.0
5,6,2023-06-15 11:00:00,completed,10.0,20.0,2.0
6,7,2023-07-01 11:30:00,pending,8.5,17.0,1.5
7,8,2023-08-12 12:00:00,completed,15.0,30.0,2.8
8,9,2023-09-23 12:30:00,cancelled,12.0,24.0,1.1
9,10,2023-10-05 13:00:00,completed,18.0,36.0,1.9


In [71]:
df.dtypes

request_id                            int64
request_date                 datetime64[ns]
request_status                       object
distance_to_travel                  float64
monetary_cost                       float64
driver_to_client_distance           float64
dtype: object

In [72]:
df['request_month'] = df['request_date'].dt.month

In [73]:
df

Unnamed: 0,request_id,request_date,request_status,distance_to_travel,monetary_cost,driver_to_client_distance,request_month
0,1,2023-01-15 08:30:00,completed,12.5,25.0,1.2,1
1,2,2023-02-01 09:00:00,pending,7.0,15.0,2.5,2
2,3,2023-03-10 09:30:00,cancelled,5.2,10.0,0.8,3
3,4,2023-04-05 10:00:00,completed,20.3,40.0,3.0,4
4,5,2023-05-20 10:30:00,pending,14.1,28.0,1.0,5
5,6,2023-06-15 11:00:00,completed,10.0,20.0,2.0,6
6,7,2023-07-01 11:30:00,pending,8.5,17.0,1.5,7
7,8,2023-08-12 12:00:00,completed,15.0,30.0,2.8,8
8,9,2023-09-23 12:30:00,cancelled,12.0,24.0,1.1,9
9,10,2023-10-05 13:00:00,completed,18.0,36.0,1.9,10


In [74]:
def compute_distance_per_dollar(group):
    return group['distance_to_travel'].values[0]/group['monetary_cost'].values[0]

In [75]:
df_distance_per_dollar = df.groupby(by="request_month").apply(compute_distance_per_dollar).reset_index().rename(columns={0: "distance_per_dollar"})
df_distance_per_dollar

Unnamed: 0,request_month,distance_per_dollar
0,1,0.5
1,2,0.466667
2,3,0.52
3,4,0.5075
4,5,0.503571
5,6,0.5
6,7,0.5
7,8,0.5
8,9,0.5
9,10,0.5


In [76]:
df = pd.merge(left=df, right=df_distance_per_dollar, how="left", on="request_month")
df

Unnamed: 0,request_id,request_date,request_status,distance_to_travel,monetary_cost,driver_to_client_distance,request_month,distance_per_dollar
0,1,2023-01-15 08:30:00,completed,12.5,25.0,1.2,1,0.5
1,2,2023-02-01 09:00:00,pending,7.0,15.0,2.5,2,0.466667
2,3,2023-03-10 09:30:00,cancelled,5.2,10.0,0.8,3,0.52
3,4,2023-04-05 10:00:00,completed,20.3,40.0,3.0,4,0.5075
4,5,2023-05-20 10:30:00,pending,14.1,28.0,1.0,5,0.503571
5,6,2023-06-15 11:00:00,completed,10.0,20.0,2.0,6,0.5
6,7,2023-07-01 11:30:00,pending,8.5,17.0,1.5,7,0.5
7,8,2023-08-12 12:00:00,completed,15.0,30.0,2.8,8,0.5
8,9,2023-09-23 12:30:00,cancelled,12.0,24.0,1.1,9,0.5
9,10,2023-10-05 13:00:00,completed,18.0,36.0,1.9,10,0.5


In [80]:
df = df.sort_values(by=['request_month'], ascending=True)

# since the month is unique - we can get the distance_per_dollar for previous month simply by shifting distance_per_dollar

df['prev_distance_per_dollar'] = df['distance_per_dollar'].shift(1)

In [81]:
df

Unnamed: 0,request_id,request_date,request_status,distance_to_travel,monetary_cost,driver_to_client_distance,request_month,distance_per_dollar,prev_distance_per_dollar
0,1,2023-01-15 08:30:00,completed,12.5,25.0,1.2,1,0.5,
1,2,2023-02-01 09:00:00,pending,7.0,15.0,2.5,2,0.466667,0.5
2,3,2023-03-10 09:30:00,cancelled,5.2,10.0,0.8,3,0.52,0.466667
3,4,2023-04-05 10:00:00,completed,20.3,40.0,3.0,4,0.5075,0.52
4,5,2023-05-20 10:30:00,pending,14.1,28.0,1.0,5,0.503571,0.5075
5,6,2023-06-15 11:00:00,completed,10.0,20.0,2.0,6,0.5,0.503571
6,7,2023-07-01 11:30:00,pending,8.5,17.0,1.5,7,0.5,0.5
7,8,2023-08-12 12:00:00,completed,15.0,30.0,2.8,8,0.5,0.5
8,9,2023-09-23 12:30:00,cancelled,12.0,24.0,1.1,9,0.5,0.5
9,10,2023-10-05 13:00:00,completed,18.0,36.0,1.9,10,0.5,0.5


### Employees With Same Birth Month

> https://platform.stratascratch.com/coding/10355-employees-with-same-birth-month?code_type=2

---

```python

# Import your libraries
import pandas as pd

# Start writing code
sf_transactions['year'] = sf_transactions['created_at'].dt.year
sf_transactions['month'] = sf_transactions['created_at'].dt.month

def generate_year_month(row):
    month_ = row['month']
    if month_ < 10:
        return f"{row['year']}-0{row['month']}"
    return f"{row['year']}-{row['month']}"


sf_transactions['year-month'] = sf_transactions.apply(generate_year_month, axis=1)

# get total revenue by each yyyy-mm

monthly_revenue = sf_transactions.groupby(by="year-month", as_index=False).agg({"value": sum}).rename(columns={"value": "monthly_revenue"}).sort_values(by="year-month")
monthly_revenue['prev_month_revenue'] = monthly_revenue['monthly_revenue'].shift(1)

monthly_revenue['revenue_diff_pct'] = (monthly_revenue['monthly_revenue'] - monthly_revenue['prev_month_revenue'])*100/monthly_revenue['prev_month_revenue']

monthly_revenue[['year-month', 'revenue_diff_pct']]

```

In [84]:
import random

In [125]:
# Define professions and their possible birth months
profession_birth_months = {
    'Engineer': list(range(1, 13)),
    'Doctor': list(range(1, 13)),
    'Artist': list(range(1, 13)),
}

# Create the dummy data
data = {
    'first_name': ['John', 'Jane', 'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hank'],
    'last_name': ['Doe', 'Smith', 'Johnson', 'Brown', 'Wilson', 'Taylor', 'Anderson', 'Thomas', 'Jackson', 'White'],
    'profession': ['Engineer', 'Engineer', 'Doctor', 'Doctor', 'Artist', 'Artist', 'Engineer', 'Engineer', 'Doctor', 'Artist'],
    'employee_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
}

# Generate random birth months within the specified range for each profession
data['birth_month'] = [random.choice(profession_birth_months[profession]) for profession in data['profession']]

# Generate random birthdays within the birth month and a range of years
data['birthday'] = [datetime(random.randint(1970, 2000), month, random.randint(1, 28)) for month in data['birth_month']]

# Create the DataFrame
df = pd.DataFrame(data)

In [126]:
df

Unnamed: 0,first_name,last_name,profession,employee_id,birth_month,birthday
0,John,Doe,Engineer,1001,2,1987-02-16
1,Jane,Smith,Engineer,1002,12,1993-12-12
2,Alice,Johnson,Doctor,1003,3,1983-03-04
3,Bob,Brown,Doctor,1004,4,1990-04-18
4,Charlie,Wilson,Artist,1005,3,1976-03-17
5,David,Taylor,Artist,1006,6,1992-06-07
6,Eve,Anderson,Engineer,1007,5,1974-05-10
7,Frank,Thomas,Engineer,1008,1,1983-01-26
8,Grace,Jackson,Doctor,1009,8,1971-08-16
9,Hank,White,Artist,1010,6,1984-06-14


In [127]:
all_birth_months = {month_: 0 for month_ in list(range(1,13))}

In [128]:
unique_professions = df['profession'].unique().tolist()

In [129]:
# if we group by each profession and birth_month - the length of that group will be the no of employees (provided no duplicate employees)
df_num_employees_by_month = df.groupby(by=["birth_month", "profession"], as_index=False).apply(lambda x: len(x)).rename(columns={None: "num_emplyees"})

In [130]:
df_num_employees_by_month

Unnamed: 0,birth_month,profession,num_emplyees
0,1,Engineer,1
1,2,Engineer,1
2,3,Artist,1
3,3,Doctor,1
4,4,Doctor,1
5,5,Engineer,1
6,6,Artist,2
7,8,Doctor,1
8,12,Engineer,1


In [132]:
df = pd.merge(left=df, right=df_num_employees_by_month, how='left', on=['birth_month', 'profession'])
df

Unnamed: 0,first_name,last_name,profession,employee_id,birth_month,birthday,num_emplyees_x,num_emplyees_y
0,John,Doe,Engineer,1001,2,1987-02-16,1,1
1,Jane,Smith,Engineer,1002,12,1993-12-12,1,1
2,Alice,Johnson,Doctor,1003,3,1983-03-04,1,1
3,Bob,Brown,Doctor,1004,4,1990-04-18,1,1
4,Charlie,Wilson,Artist,1005,3,1976-03-17,1,1
5,David,Taylor,Artist,1006,6,1992-06-07,2,2
6,Eve,Anderson,Engineer,1007,5,1974-05-10,1,1
7,Frank,Thomas,Engineer,1008,1,1983-01-26,1,1
8,Grace,Jackson,Doctor,1009,8,1971-08-16,1,1
9,Hank,White,Artist,1010,6,1984-06-14,2,2


### Monthly Percentage Difference

> https://platform.stratascratch.com/coding/10319-monthly-percentage-difference?code_type=2

---

#### Question

> https://platform.stratascratch.com/coding/9917-average-salaries?python=1

---

#### Soln:

```python
import pandas as pd

# Start writing code
avg_dept = employee.groupby(by=['department'], as_index=False).agg({'salary': 'mean'})
avg_dept.rename(columns={'salary': 'avg_salary'}, inplace=True)
merged = pd.merge(left=employee, right=avg_dept, how='left', on=['department'])[['department', 'first_name', 'salary', 'avg_salary']]
return merged
```

### Calculated Fields


In [None]:
covid_df['Recovery Rate'] = covid_df['Recovered'] / covid_df['Confirmed'] 
covid_df.head()

Unnamed: 0,Date,State,District,Confirmed,Recovered,Deceased,Other,Tested,Recovery Rate
0,2020-04-26,Andaman and Nicobar Islands,Unknown,33,11,0,0,,0.333333
1,2020-04-26,Andhra Pradesh,Anantapur,53,14,4,0,,0.264151
2,2020-04-26,Andhra Pradesh,Chittoor,73,13,0,0,,0.178082
3,2020-04-26,Andhra Pradesh,East Godavari,39,12,0,0,,0.307692
4,2020-04-26,Andhra Pradesh,Guntur,214,29,8,0,,0.135514


### Question

> https://platform.stratascratch.com/coding/10012-advertising-channel-effectiveness?python=1

### Soln:

```python
# Import your libraries
import pandas as pd

uber_advertising = uber_advertising[uber_advertising['year'].isin([2017, 2018])]

# Start writing code
df_by_channel = uber_advertising.groupby(by=['advertising_channel'], as_index=False).agg({'money_spent': 'sum', 'customers_acquired': 'sum'})
df_by_channel['avg_effectiveness'] = df_by_channel['money_spent']/df_by_channel['customers_acquired']

df_by_channel[['advertising_channel', 'avg_effectiveness']].sort_values(by='avg_effectiveness')
```

### Question

> https://platform.stratascratch.com/coding-question?id=2028&python=1

### Soln 

> Official: https://www.stratascratch.com/blog/microsoft-data-scientist-interview-questions/



```python
# Import your libraries
import pandas as pd

## extract only reqd cols
fact_events = fact_events[['time_id', 'user_id']]

## create month col
fact_events['month'] = fact_events['time_id'].dt.month

## get first joining month for each user
user_joining_info = fact_events.groupby(by='user_id', as_index=False).agg({'month': 'min'})

user_joining_info.rename(columns={'month': 'first_month'}, inplace=True)

## merge with events df
fact_events_users = pd.merge(left=fact_events, right=user_joining_info, on='user_id', how='inner')[['user_id', 'month', 'first_month']]

## drop duplicates; we do not want multiple records of same month -> user -> first_month combo
fact_events_users.drop_duplicates(inplace=True)

## user is new if month == first_month
fact_events_users['new_user'] = fact_events_users['month'] == fact_events_users['first_month']

## create 1 and 0 cols for new and existing users, so that we can sum up
fact_events_users['new_user_bool'] = [1 if user_ is True else 0 for user_ in fact_events_users['new_user']]
fact_events_users['existing_user_bool'] = [0 if user_ is True else 1 for user_ in fact_events_users['new_user']]

## sum to find new and existing users for each month
fact_events_users = fact_events_users.groupby(by=['month'], as_index=False).agg({'new_user_bool': 'sum', 'existing_user_bool': 'sum'})
fact_events_users
```

### Question

> https://platform.stratascratch.com/coding/10353-workers-with-the-highest-salaries?python=1

---

#### Soln:

```python
# Start writing code
max_salary = np.max(worker['salary'])

worker_max_salary = worker.loc[worker['salary']==max_salary][['first_name', 'last_name', 'worker_id']]

worker_max_title = pd.merge(left=worker_max_salary, right=title, left_on='worker_id', right_on='worker_ref_id', how='inner')[['worker_title']]

worker_max_title
```

### Applying Functions


#### Question

> https://platform.stratascratch.com/coding/9633-city-with-most-amenities?python=1

#### Soln:

```python
airbnb_search_details['num_amenities'] = airbnb_search_details.apply(lambda x: len(x['amenities'].split(',')), axis=1)
airbnb_search_details.groupby(by=['city'],as_index=False).agg({'num_amenities': 'sum'}).sort_values(by='num_amenities', ascending=False).head(1)
```

#### Question

> https://platform.stratascratch.com/coding/9726-classify-business-type?python=1

#### Soln

```python
def classify_business(name, specs):
    name = name.lower()
    for spec, spec_values in specs.items():
        for val in spec_values:
            if val in name:
                return spec
    return 'other'
specs = {
    'restaurant': ['restaurant'],
    'cafe': ['cafe', 'café','coffee'],
    'school': ['school']
}

sf_restaurant_health_violations['classification'] = sf_restaurant_health_violations['business_name'].apply(classify_business, specs=specs)
sf_restaurant_health_violations[['business_name', 'classification']].drop_duplicates()
```

#### Question

> https://platform.stratascratch.com/coding/10145-make-a-pivot-table-to-find-the-highest-payment-in-each-year-for-each-employee?python=1

While this problem might be a bit difficult with SQL, with Pandas, this problem can be solved in a single line of code using the pivot_table() function. We simply pass the right arguments and get the desired output.

```python
pd.pivot_table(data = sf_public_salaries, columns = ['year'], 
index = 'employeename', values = 'totalpay', aggfunc = 
'max', fill_value = 0).reset_index()
```


#### Question

>>> https://platform.stratascratch.com/coding/9784-time-between-two-events?python=1


```python
log_merged = pd.merge(left=facebook_web_log, right=facebook_web_log, on=['user_id'], how='inner', suffixes=('_1', '_2'))

log_merged = log_merged.loc[(log_merged['action_1'] == 'page_load') & (log_merged['action_2'] == 'scroll_down')]

log_merged['time_elapsed'] = log_merged['timestamp_2'] - log_merged['timestamp_1']
log_merged['time_elapsed_s'] = log_merged['time_elapsed'].apply(lambda x: x.seconds)

log_merged.loc[log_merged['time_elapsed_s']>0].sort_values(by=['time_elapsed_s'])
```



#### Datetime ops

#### Question

> https://platform.stratascratch.com/coding/2034-avg-earnings-per-weekday-and-hour?python=1

#### Soln

```python
# Start writing code
df = doordash_delivery[['customer_placed_order_datetime', 'order_total']]
df['day'] = df['customer_placed_order_datetime'].dt.day_name()
df['hour'] = df['customer_placed_order_datetime'].dt.hour

df = df.groupby(by=['day', 'hour'], as_index=False).agg({'order_total': 'mean'}).sort_values(by=['day', 'hour'])

df
```


#### Question

> https://platform.stratascratch.com/coding/2052-user-growth-rate?python=1

#### Soln:

```python
sf_events['year'] = sf_events['date'].dt.year
sf_events['month'] = sf_events['date'].dt.month

jan_users = sf_events.loc[(sf_events['month'] == 1) & (sf_events['year'] == 2021)]

jan_users = jan_users.groupby(by=['account_id'], as_index=False).agg({'user_id': 'count'})

dec_users = sf_events.loc[(sf_events['month'] == 12) & (sf_events['year'] == 2020)]
dec_users = dec_users.groupby(by=['account_id'], as_index=False).agg({'user_id': 'count'})

comparison_df = pd.merge(left=dec_users, right=jan_users, how='inner', on='account_id', suffixes=('_dec', '_jan'))
comparison_df['growth_rate'] = comparison_df['user_id_jan']/comparison_df['user_id_dec']

comparison_df
```


#### Question (see how we apply custom function on a rolling window)

> https://platform.stratascratch.com/coding/10319-monthly-percentage-difference?python=1

---

#### Soln

```python
import pandas as pd
import numpy as np

# Start writing code
sf_transactions.head()

### get year and month
sf_transactions['year'] = sf_transactions['created_at'].dt.year
sf_transactions['month'] = sf_transactions['created_at'].dt.month

### year-month wise sum of value
year_month_wise = sf_transactions.groupby(by=['year', 'month'], as_index=False).agg({'value': 'sum'})
year_month_wise = year_month_wise.sort_values(by=['year', 'month'], ascending=[True, True])

def perc_change(x):
    if len(x) < 2:
        return np.nan
    else:
        return np.round((x[1] - x[0])*100/x[0],2)
        
def zero_pad(x):
    if x < 10:
        return f'{0}{x}'
    else:
        return f'{x}'

### calculate % change on a rolling window
year_month_wise['perc_change'] = year_month_wise['value'].rolling(window=2, min_periods=1).apply(perc_change)

### pad month with a leading 0
year_month_wise['month_zero_padded'] = year_month_wise['month'].apply(zero_pad)

### concat yyyy-mm
year_month_wise['year_month_str'] = year_month_wise['year'].astype(str) + '-' + year_month_wise['month_zero_padded'].astype(str)

year_month_wise[['year_month_str', 'perc_change']]

```




#### Text manipulation

As with datetime functions, Pandas provides a range of string functions. Like the .dt accessor for datetime functions, we can use .str accessor to use the standard string functions across the entire Series. There are some additional functions beyond the standard string library that can come in handy. Let us look at a few examples of Python Pandas interview questions. The first one is from a City of Los Angeles Data Science Interview

#### Question

> https://platform.stratascratch.com/coding/9697-bakery-owned-facilities?python=1

#### Soln

While there are a lot of columns in the dataset, the relevant ones are owner_name and pe_description. We start off by keeping only the relevant columns in the dataset and dropping duplicates (if any).

We then proceed to search for the text BAKERY in the owner_name field and LOW RISK in the pe_description field. To do this, we use the str.lower() method to convert all the values to lowercase and the .str.find() method to find the instances of the relevant text. The .str.find() is an extension of the Python built in method find() for string type variables.

__Note how we have to use .str again on op of .lower to use .find__

```python

# Keep relevant fields
rel_df = los_angeles_restaurant_health_inspections[['owner_name', 'pe_description']].drop_duplicates()
rel_df

rel_df[(rel_df['pe_description'].str.lower().str.find('low risk') != -1) & (rel_df['owner_name'].str.lower().str.find('bakery') != -1)]

```



### Question

> https://platform.stratascratch.com/coding/10303-top-percentile-fraud?python=

#### Soln

```python
import numpy as np
import pandas as pd

def get_top_perc_fraud(fraud_scores):
    return np.quantile(fraud_scores, 0.95)

### group by state and get the top 5% value of each state
top_scores_per_state = fraud_score.groupby(by='state', as_index=False).agg({'fraud_score': get_top_perc_fraud})

### merge to get the reference top 5% score for each state
merged_df = pd.merge(left=fraud_score, right=top_scores_per_state, how='left', on='state', suffixes=('', '_ref'))

### filter
merged_df = merged_df.loc[merged_df['fraud_score'] >= merged_df['fraud_score_ref']]

merged_df

```

### Question

> https://platform.stratascratch.com/coding/10302-distance-per-dollar?python=

---

#### Soln

```python
import pandas as pd
import numpy as np
uber_request_logs.head()

### dist_per_dollar = distance_to_travel/monetary_cost
uber_request_logs['dist_per_dollar'] = uber_request_logs['distance_to_travel']/uber_request_logs['monetary_cost']

### filter reqd cols
uber_request_logs = uber_request_logs[['request_date', 'distance_to_travel', 'monetary_cost', 'dist_per_dollar']]

### convert request date to str
uber_request_logs['request_date'] = uber_request_logs['request_date'].astype('str')

### create year-month col
uber_request_logs['year-month'] = uber_request_logs['request_date'].str.strip().str.slice(0, 7)

### Get avg dist_per_dollar for each year-month
uber_request_logs_y_m = uber_request_logs.groupby(by='year-month', as_index=False).agg({'dist_per_dollar': 'mean'})

### merge with main df
merged_df = pd.merge(left=uber_request_logs, right=uber_request_logs_y_m, how='left', suffixes=('', '_monthly'), on='year-month')[['year-month', 'dist_per_dollar', 'dist_per_dollar_monthly']]

### compute diff
merged_df['dist_per_dollar_diff'] = np.abs(merged_df['dist_per_dollar'] - merged_df['dist_per_dollar_monthly'])

### return final df grouped by year-month and avg(abs diff in dist per dollar) 
merged_df = merged_df.groupby(by='year-month', as_index=False).agg({'dist_per_dollar_diff': 'mean'}).sort_values(by='year-month')
```

### Question

> https://platform.stratascratch.com/coding/10285-acceptance-rate-by-date?python=
---

#### Soln:

```python
import pandas as pd
import numpy as np

### joi with same df to get all combos
all_combos = pd.merge(left=fb_friend_requests, right=fb_friend_requests, how='inner', on=['user_id_sender', 'user_id_receiver'])

### get the accepted requests
all_combos_accepted = all_combos.loc[(all_combos['action_x']=='sent') & (all_combos['action_y']=='accepted')]

### get the sent requests
all_combos_sent = all_combos.loc[(all_combos['action_x']=='sent') & (all_combos['action_y']=='sent')]

### count all sent requests by date
sent_by_date = all_combos_sent.groupby(by='date_x', as_index=False).agg({'action_x':'count'})

### count all accepted requests by date
accepted_by_date = all_combos_accepted.groupby(by='date_x', as_index=False).agg({'action_x':'count'})

### merge and calculate acceptance_rate
final_merged = pd.merge(left=sent_by_date, right=accepted_by_date, how='inner', on='date_x')

final_merged['acceptance_rate'] = final_merged['action_x_y']/final_merged['action_x_x']

final_merged[['date_x', 'acceptance_rate']]
```

### Question

> https://platform.stratascratch.com/coding/10300-premium-vs-freemium?python=

---

#### Soln:

```python
import pandas as pd

### merge all dfs
df = pd.merge(left=ms_download_facts, right=ms_user_dimension, how='inner', on='user_id')
df = pd.merge(left=df, right=ms_acc_dimension, how='inner', on='acc_id')

### get SUM downloads by date for non-paying customers
non_paying = df.loc[df['paying_customer'] == 'no'].groupby(by='date',as_index=False).agg({'downloads':'sum'}).sort_values(by='date')

### get SUM downloads by date for paying customers
paying = df.loc[df['paying_customer'] == 'yes'].groupby(by='date',as_index=False).agg({'downloads':'sum'}).sort_values(by='date')


### merge and filter
results_df = pd.merge(left=paying, right=non_paying, how='inner', on='date')

results_df.loc[results_df['downloads_y']>results_df['downloads_x']].sort_values(by='date')
```

### Question

> https://www.youtube.com/watch?v=QenwDm5oWdU

---

#### Soln:

```python
import pandas as pd
import numpy as np
import datetime

### create new YYYY-MM col by converting date to string
sf_transactions['year_month'] = [date_.strftime('%Y-%m') for date_ in sf_transactions['created_at']]

### sort by YYYY-MM
sf_transactions = sf_transactions.sort_values(by='year_month')

### year-month revenue sum
revenue_sum = sf_transactions.groupby(by='year_month', as_index=False).agg({'value': 'sum'})

def get_prev(x):
    return x.values[0]

### rolling window to get prev month's revenue
revenue_sum['prev'] = revenue_sum['value'].rolling(window=2).apply(get_prev)

### compute % change and round it to 2 decimal places
revenue_sum['perc_change'] = np.round((revenue_sum['value'] - revenue_sum['prev'])*100/revenue_sum['prev'],2)

revenue_sum[['year_month', 'perc_change']]
```

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=19e38803-1d2f-4cb3-a2c4-21b34f64c562' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>