In [2]:
import numpy as np
import pandas as pd

## **Renaming column name(s), droping column(s) and creating new columns from existing columns**

**Renaming columns using rename():**

The rename() function allows you to change the names of one or more columns in a DataFrame. It takes a dictionary or a mapping of the old column names to the new column names.

**Removing columns using drop():**

The drop() function is used to remove one or more columns from a DataFrame. It takes the column names or labels as input and returns a new DataFrame without the specified columns. The drop() function can modify the DataFrame in-place by setting the inplace parameter to True

### *Renaming column names*

In [9]:
# Create a DataFrame
df_c = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

print(df_c)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


In [4]:
df_c.rename(columns = {"A":"alpha", "B": "bravo", "C":"charlie"}, inplace = True)

print(df_c)

   alpha  bravo  charlie
0      1      4        7
1      2      5        8
2      3      6        9


### *Droping rows and columns*

In [5]:
# Remove a single column
df_c = df_c.drop('alpha', axis=1)
print(df_c)

   bravo  charlie
0      4        7
1      5        8
2      6        9


In [12]:
# Remove multiple columns or multiple rows
df_c = df_c.drop([0, 2], axis=1)

print(df_c)

KeyError: '[0, 2] not found in axis'

### *Creating new columns from the existing columns*

In [13]:
# Create a DataFrame
df_c = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

print(df_c)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


In [14]:
df_c["D"] = df_c["B"]/df_c["C"]

print(df_c)

   A  B  C         D
0  1  4  7  0.571429
1  2  5  8  0.625000
2  3  6  9  0.666667


In [15]:
df_c["E"] = df_c["A"] * df_c["B"]

In [16]:
df_c

Unnamed: 0,A,B,C,D,E
0,1,4,7,0.571429,4
1,2,5,8,0.625,10
2,3,6,9,0.666667,18


## **apply, map and applymap functions**

apply, map, and applymap are methods used for applying functions to manipulate data within a **DataFrame or Series**

**apply():** This method is used to apply a function along an axis of the DataFrame or Series. It can be used with both DataFrames and Series objects. The function is applied to each column or row of the DataFrame or each element of the Series. The apply() method is versatile and allows for applying custom functions or built-in functions to transform data.

**map():** This method is primarily used with Series objects and is used to substitute values based on a provided mapping or a function. It takes a Series and applies a transformation based on the specified mapping or function. It's commonly used for data manipulation and creating derived columns.

**applymap():** This method is specifically used with DataFrames to apply a function element-wise to each element of the DataFrame. It applies the provided function to each individual cell of the DataFrame.

In [17]:
Student_scores = {"Maths" : [55, 85, 95, 99],
                  "Comp_sci": [98, 85, 65, 87],
                  "Physics" : [85, 95, 90, 75]}

scores = pd.DataFrame(Student_scores, index = ["Student1", "Student2", "Student3", "Student4" ])
print(scores)

          Maths  Comp_sci  Physics
Student1     55        98       85
Student2     85        85       95
Student3     95        65       90
Student4     99        87       75


In [19]:
scores.apply(lambda x: x.sum(), axis = 0) ## axis = 0 we are applying the method along the column axis = 1, we are applying method along rows

Maths       334
Comp_sci    335
Physics     345
dtype: int64

In [20]:
scores["Maths"].apply(lambda x: x+2)

Student1     57
Student2     87
Student3     97
Student4    101
Name: Maths, dtype: int64

In [25]:
## using apply method for series
s = pd.Series([1, 2, 3, 4, 5])

# Apply a function to double each element of the Series
s_doubled = s.apply(lambda x: x**0.50)

print(s_doubled)

0    1.000000
1    1.414214
2    1.732051
3    2.000000
4    2.236068
dtype: float64


In [28]:
## applymap function to dataframe 

percentage_score = scores.applymap(lambda x: x /100)

print(percentage_score)


          Maths  Comp_sci  Physics
Student1   0.55      0.98     0.85
Student2   0.85      0.85     0.95
Student3   0.95      0.65     0.90
Student4   0.99      0.87     0.75


# **Pandas groupby function**

<img src = "groupby_graph.png" style = "width:880px;height:250px">

**Reference: McKinney, W. (2012). Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.".**



The groupby function in Pandas is a powerful tool for **splitting a DataFrame** into groups based on one or more criteria (variables), **performing operations** on each group, and combining the results into a new DataFrame. It allows for **data aggregation**, transformation, and analysis on subsets of data defined by unique values in one or more columns.

Once you have created a groupby object, you can apply various aggregation or transformation functions on it, such as sum(), mean(), count(), apply(), etc., to calculate statistics or manipulate the grouped data.

**What type of variables are generally used for grouping data?**

We generally use categorical variables to group the data. For example, continent, country and year are categorical variables because they take the limited number of fixed values. There is no point to use the continous variable for grouping purposes.

Groupby method creates only meaningful analysis when groupby function is *followed* by the some function usually aggregation function.

In [29]:
# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Value': [10, 15, 12, 8, 20]
}
df = pd.DataFrame(data)

print(df)


      Name Category  Value
0    Alice        A     10
1      Bob        B     15
2  Charlie        A     12
3    Alice        B      8
4      Bob        A     20


In [41]:
# Group by 'Category'
grouped = df.groupby('Name') ## Here we are grouping by a single column

# Calculate the sum of 'Value' within each group
sum_by_category = grouped['Value'].min()


In [40]:
print(sum_by_category)

Name
Alice       8
Bob        15
Charlie    12
Name: Value, dtype: int64


In [42]:
list(grouped)

[('Alice',
      Name Category  Value
  0  Alice        A     10
  3  Alice        B      8),
 ('Bob',
    Name Category  Value
  1  Bob        B     15
  4  Bob        A     20),
 ('Charlie',
        Name Category  Value
  2  Charlie        A     12)]

In [None]:
print(type(grouped))
print(type(grouped["Value"]))

In [None]:
##What do we mean by splitting the dataframe on the bases of group and then combining?

df.loc[df["Category"]=="B"]["Value"].sum()

In [None]:
df.loc[df["Category"]=="A"]["Value"].sum()

An example that demonstrates grouping by multiple columns using the groupby function in Pandas

In [None]:
# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Value1': [10, 15, 12, 8, 20],
    'Value2': [5, 3, 7, 2, 10]
}
df1 = pd.DataFrame(data)


In [None]:
# Group by 'Name' and 'Category'
grouped = df1.groupby(by = ['Name', 'Category']) ## We can pass the names of multiple columns as strings in a list.

# Calculate the sum of 'Value1' and 'Value2' within each group
sum_by_group = grouped["Value1"].sum()

print(sum_by_group)


## Aggregation functions that we can use after groupby

When using the groupby function in Pandas, you can apply various aggregation functions to calculate summary statistics or perform computations on the grouped data. Some common aggregation functions include:


**sum():** Calculates the sum of values within each group.

**mean():** Computes the mean (average) of values within each group.

**median():** Calculates the median of values within each group.

**min():** Finds the minimum value within each group.

**max():** Finds the maximum value within each group.

**count():** Counts the number of non-null values within each group.

**size():** Counts the total number of rows within each group, including null values.

**std():** Computes the standard deviation of values within each group.

**var():** Calculates the variance of values within each group.

**first():** Returns the first value within each group.

**last():** Returns the last value within each group.

**nth():** Returns the nth value within each group.

**apply():** Applies a custom function to each group

In [None]:
# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Value': [10, 15, 12, 8, 20]
}
df = pd.DataFrame(data)

# Group by 'Category'
grouped = df.groupby('Category')


In [None]:
# Let us apply aggregation functions
sum_by_category = grouped['Value'].sum()
mean_by_category = grouped['Value'].mean()
max_by_category = grouped['Value'].max()
count_by_category = grouped['Value'].count()


In [None]:
print("Sum:")
print(sum_by_category)


print("Mean:")
print(mean_by_category)


print("Max:")
print(max_by_category)


print("Count:")
print(count_by_category)

In [None]:
## Data analysis questions

In [43]:
gapminder = pd.read_csv("gap_minder.csv")

In [47]:
gapminder.groupby(by = "year")["pop"].sum()

year
1952    2406957150
1957    2664404580
1962    2899782974
1967    3217478384
1972    3576977158
1977    3930045807
1982    4289436840
1987    4691477418
1992    5110710260
1997    5515204472
2002    5886977579
2007    6251013179
Name: pop, dtype: int64

## **Simple groupby applications**

- Total number of countries in each continent in 2007.
- The most and least popolous country in each continent in 2007.
- The richest and poorest country in each continent in 2007.

### *Total number of countries in each continent in 2007.*

In [50]:
gapminder[ (gapminder["year"]== 2007) ].groupby(by = "continent")["country"].count()

continent
Africa      52
Americas    25
Asia        33
Europe      30
Oceania      2
Name: country, dtype: int64

### *The most and least popolous country in each continent in 2007.*

In [51]:
most_populous = gapminder[ (gapminder["year"]== 2007) ].groupby(by = ["continent"])["pop"].max()

In [52]:
most_populous

continent
Africa       135031164
Americas     301139947
Asia        1318683096
Europe        82400996
Oceania       20434176
Name: pop, dtype: int64

In [53]:
# Use boolean indexing to filter rows
gapminder[gapminder["pop"].isin(most_populous)]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
71,Australia,Oceania,2007,81.235,20434176,34435.36744
299,China,Asia,2007,72.961,1318683096,4959.114854
575,Germany,Europe,2007,79.406,82400996,32170.37442
1139,Nigeria,Africa,2007,46.859,135031164,2013.977305
1619,United States,Americas,2007,78.242,301139947,42951.65309


In [56]:
least_populous = gapminder[ (gapminder["year"]== 2007) ].groupby(by = ["continent"])["pop"].min()

In [57]:
gapminder[gapminder["pop"].isin(least_populous)]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
95,Bahrain,Asia,2007,75.635,708573,29796.04834
695,Iceland,Europe,2007,81.757,301931,36180.78919
1103,New Zealand,Oceania,2007,80.204,4115771,25185.00911
1307,Sao Tome and Principe,Africa,2007,65.528,199579,1598.435089
1559,Trinidad and Tobago,Americas,2007,69.819,1056608,18008.50924


### *The richest and poorest country in each continent in 1952.*

In [58]:
richest_country = gapminder[ (gapminder["year"]== 1952) ].groupby(by = ["continent"])["gdpPercap"].max()

In [60]:
gapminder[gapminder["gdpPercap"].isin(richest_country)]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
852,Kuwait,Asia,1952,55.565,160000,108382.3529
1092,New Zealand,Oceania,1952,69.39,1994794,10556.57566
1404,South Africa,Africa,1952,45.009,14264935,4725.295531
1476,Switzerland,Europe,1952,69.62,4815000,14734.23275
1608,United States,Americas,1952,68.44,157553000,13990.48208


In [61]:
poorest_country = gapminder[ (gapminder["year"]== 1952) ].groupby(by = ["continent"])["gdpPercap"].min()

In [62]:
gapminder[gapminder["gdpPercap"].isin(poorest_country)]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
60,Australia,Oceania,1952,69.12,8691212,10039.59564
144,Bosnia and Herzegovina,Europe,1952,53.82,2791000,973.533195
432,Dominican Republic,Americas,1952,45.928,2491346,1397.717137
876,Lesotho,Africa,1952,42.138,748747,298.846212
1044,Myanmar,Asia,1952,36.319,20092996,331.0


## **Total population increase between 1952 and 2007**

What should be the grouping variable(s) for answering this question?

In [68]:
gapminder[(gapminder["year"] == 1952) | (gapminder["year"] == 2007)].groupby(by = "year")["pop"].sum()

year
1952    2406957150
2007    6251013179
Name: pop, dtype: int64

In [69]:
6251013179 - 2406957150

3844056029

## **Which continent has the largest population increase between 1952 and 2007?**

In [None]:
Pop_increase_by_continent = gapminder[(gapminder["year"] == 1952) | (gapminder["year"] == 2007)].groupby(by = ["continent","year"])["pop"].sum()

In [None]:
Pop_increase_by_continent

## **Average life expectancy by continent in 1952 and 2007**

What should be the grouping variable(s) for answering this question?

In [74]:
avg_LifeExp = gapminder[(gapminder["year"] == 1952)| (gapminder["year"] == 2007)].groupby(by = ["continent","year"])["lifeExp"].mean()

In [75]:
avg_LifeExp

continent  year
Africa     1952    39.135500
           2007    54.806038
Americas   1952    53.279840
           2007    73.608120
Asia       1952    46.314394
           2007    70.728485
Europe     1952    64.408500
           2007    77.648600
Oceania    1952    69.255000
           2007    80.719500
Name: lifeExp, dtype: float64

## **Continent with the greatest life expectancy increase between 1952 - 2007**

In [None]:
df = pd.DataFrame(avg_LifeExp).pivot_table(index = "continent", columns = "year")["lifeExp"]

In [None]:
year_1952 = pd.DataFrame(avg_LifeExp).pivot_table(index = "continent", columns = "year")["lifeExp"][1952]

In [None]:
year_2007 = pd.DataFrame(avg_LifeExp).pivot_table(index = "continent", columns = "year")["lifeExp"][2007]

In [None]:
df["LifeExp_increase"] = year_2007-year_1952

In [None]:
df.sort_values(by = "LifeExp_increase", ascending = False)

## **Top five countries with highest life expectancy in each continent in 2007**

In [76]:
top_five_countries = gapminder[gapminder["year"] ==2007].groupby(by = "continent")["lifeExp"].apply(lambda x: pd.Series(x.sort_values(ascending = False).head()))

In [77]:
top_five_countries

continent      
Africa     1271    76.442
           911     73.952
           1571    73.923
           983     72.801
           35      72.301
Americas   251     80.653
           359     78.782
           1259    78.746
           287     78.553
           395     78.273
Asia       803     82.603
           671     82.208
           767     80.745
           1367    79.972
           851     78.623
Europe     695     81.757
           1487    81.701
           1427    80.941
           1475    80.884
           539     80.657
Oceania    71      81.235
           1103    80.204
Name: lifeExp, dtype: float64

In [78]:
gapminder[gapminder["lifeExp"].isin(top_five_countries)].sort_values(by = ["continent", "lifeExp"], ascending = [True, False])

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1271,Reunion,Africa,2007,76.442,798094,7670.122558
911,Libya,Africa,2007,73.952,6036914,12057.49928
1571,Tunisia,Africa,2007,73.923,10276158,7092.923025
983,Mauritius,Africa,2007,72.801,1250882,10956.99112
35,Algeria,Africa,2007,72.301,33333216,6223.367465
251,Canada,Americas,2007,80.653,33390141,36319.23501
359,Costa Rica,Americas,2007,78.782,4133884,9645.06142
1259,Puerto Rico,Americas,2007,78.746,3942491,19328.70901
287,Chile,Americas,2007,78.553,16284741,13171.63885
395,Cuba,Americas,2007,78.273,11416987,8948.102923


### *Another way to do the same thing*

In [None]:
top_five_index = [ country for continent, country in multi_index_list]

In [None]:
gapminder.iloc[top_five_index, ]

### *Application of apply() method to Pandas series*

- Convert the population series (column) into millions by dividing each value by 1000,000.
- Convert the Continent names into uppcase.

In [79]:
gapminder["pop"].apply(lambda x : x/1000000) ## if you want to make changes in place, then assigned back to the same column name

0        8.425333
1        9.240934
2       10.267083
3       11.537966
4       13.079460
          ...    
1699     9.216418
1700    10.704340
1701    11.404948
1702    11.926563
1703    12.311143
Name: pop, Length: 1704, dtype: float64

In [None]:
gapminder["continent"].apply(lambda x: x.lower())   