### Python DataWrangling Pandas Combining

_Combine DataFrames with .concat()_

In this lesson, we'll return to the video game sales dataset. Here are the first few rows to remind you of its structure:

In [1]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')
print(df.head())

                       name platform  year_of_release         genre publisher  \
0                Wii Sports      Wii           2006.0        Sports  Nintendo   
1         Super Mario Bros.      NES           1985.0      Platform  Nintendo   
2            Mario Kart Wii      Wii           2008.0        Racing  Nintendo   
3         Wii Sports Resort      Wii           2009.0        Sports  Nintendo   
4  Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing  Nintendo   

  developer  na_sales  eu_sales  jp_sales  critic_score user_score  
0  Nintendo     41.36     28.96      3.77          76.0          8  
1       NaN     29.08      3.58      6.81           NaN        NaN  
2  Nintendo     15.68     12.76      3.79          82.0        8.3  
3  Nintendo     15.61     10.93      3.28          80.0          8  
4       NaN     11.27      8.89     10.22           NaN        NaN  


We want to know some general statistics about game publishers, specifically:

- their average review score;
- their total sales.

As we've already seen, we can do this using groupby(). First, let's get the average review score for each publisher:

In [2]:
mean_score = df.groupby('publisher')['critic_score'].mean()
print(mean_score)

publisher
10TACLE Studios                 42.000000
1C Company                      73.000000
20th Century Fox Video Games          NaN
2D Boy                          90.000000
3DO                             57.470588
                                  ...    
id Software                     85.000000
imageepoch Inc.                       NaN
inXile Entertainment            81.000000
mixi, Inc                             NaN
responDESIGN                          NaN
Name: critic_score, Length: 581, dtype: float64


Let's also get the number of sales. The easiest way to do this is with a second groupby():

In [3]:
df['total_sales'] = df['na_sales'] + df['eu_sales'] + df['jp_sales']
num_sales = df.groupby('publisher')['total_sales'].sum()
print(num_sales)

publisher
10TACLE Studios                 0.11
1C Company                      0.08
20th Century Fox Video Games    1.92
2D Boy                          0.03
3DO                             9.52
                                ... 
id Software                     0.02
imageepoch Inc.                 0.04
inXile Entertainment            0.09
mixi, Inc                       0.87
responDESIGN                    0.13
Name: total_sales, Length: 581, dtype: float64


Notice that the index for both results is the 'publisher' column _because we grouped by 'publisher' in both cases_. __Since both results share the same index, we can easily join the results__ into a DataFrame using pandas' concat() function:

In [4]:
df_concat = pd.concat([mean_score, num_sales], axis='columns')
print(df_concat)

                              critic_score  total_sales
publisher                                              
10TACLE Studios                  42.000000         0.11
1C Company                       73.000000         0.08
20th Century Fox Video Games           NaN         1.92
2D Boy                           90.000000         0.03
3DO                              57.470588         9.52
...                                    ...          ...
id Software                      85.000000         0.02
imageepoch Inc.                        NaN         0.04
inXile Entertainment             81.000000         0.09
mixi, Inc                              NaN         0.87
responDESIGN                           NaN         0.13

[581 rows x 2 columns]


In general, concat() expects a list of Series and/or DataFrame objects. To get our result, we passed a list of Series variables to concat() and set axis='columns' to ensure they were combined as columns.

Note that the original column names are preserved in the concatenated DataFrame.

We can rename columns using the columns method. It can be called on a DataFrame and passed a list of new column names to replace the existing ones. The new names must be passed in the same order as the original column names.

Let's rename 'critic_score', as it now represents an average:

In [5]:
df_concat.columns = ['avg_critic_score', 'total_sales']
print(df_concat)

                              avg_critic_score  total_sales
publisher                                                  
10TACLE Studios                      42.000000         0.11
1C Company                           73.000000         0.08
20th Century Fox Video Games               NaN         1.92
2D Boy                               90.000000         0.03
3DO                                  57.470588         9.52
...                                        ...          ...
id Software                          85.000000         0.02
imageepoch Inc.                            NaN         0.04
inXile Entertainment                 81.000000         0.09
mixi, Inc                                  NaN         0.87
responDESIGN                               NaN         0.13

[581 rows x 2 columns]


In general, it's a good idea to rename columns after grouping and processing to give a more indicative representation of how the columns were processed.

You may have noticed that we could get the same result as before using agg(). However, concat() is quite versatile. We can use it to concatenate DataFrames:

by rows, assuming they have the same number of columns;
by columns if they have the same number of rows.
To concatenate rows from separate DataFrames, we can use concat() and set axis='index' (or omit this parameter, as axis='index' is the default argument). Alternatively, we can use integers for the index= argument, where index=0 will concatenate rows and index=1 will concatenate columns.

Here's an example where we filter the data in two separate DataFrames based on gender and then recombine them into a single DataFrame:

In [6]:
rpgs = df[df['genre'] == 'Role-Playing']
platformers = df[df['genre'] == 'Platform']

df_concat = pd.concat([rpgs, platformers])
print(df_concat[['name', 'genre']])

                                                   name         genre
4                              Pokemon Red/Pokemon Blue  Role-Playing
12                          Pokemon Gold/Pokemon Silver  Role-Playing
20                        Pokemon Diamond/Pokemon Pearl  Role-Playing
25                        Pokemon Ruby/Pokemon Sapphire  Role-Playing
27                          Pokemon Black/Pokemon White  Role-Playing
...                                                 ...           ...
16356                                    Strider (2014)      Platform
16358                                Goku Makaimura Kai      Platform
16603  The Land Before Time: Into the Mysterious Beyond      Platform
16710                Woody Woodpecker in Crazy Castle 5      Platform
16715                                  Spirits & Spells      Platform

[2388 rows x 2 columns]


And so two DataFrames are merged into one! Remember, __this works here because both smaller DataFrames have the same columns__.