<a href="https://colab.research.google.com/github/RajuNaik29/Python-Libraries/blob/main/Pandas_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [211]:
import pandas as pd
import numpy as np

In [212]:
!gdown "1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_"

Downloading...
From: https://drive.google.com/uc?id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_
To: /content/mckinsey.csv
  0% 0.00/83.8k [00:00<?, ?B/s]100% 83.8k/83.8k [00:00<00:00, 100MB/s]


In [213]:
df = pd.read_csv('mckinsey.csv')
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


### **pd.set_index()**

The **`df.set_index()`** method in Pandas is used to set one or more columns as the index of a DataFrame. Here's an explanation of its functionality:

- **Method**: `df.set_index()`
- **Input**:
  - `keys`: The column(s) you want to set as the index. It can be a single column name or a list of column names.
  - Additional parameters are available to control the behavior of the operation, such as `drop`, `append`, `inplace`, `verify_integrity`, and `sort`.
- **Output**: Returns a new DataFrame with the specified column(s) set as the index.
- **Usage**:
  - Typically used when you want to change the index of a DataFrame to a column or a combination of columns that better represent the data.
- **Parameters**:
  - `keys`: The column(s) to set as the index.
  - `drop`: If set to True, drops the column(s) used as the new index from the DataFrame. Default is False.
  - `append`: If set to True, appends the new index to the existing index, creating a MultiIndex. Default is False.
  - `inplace`: If set to True, modifies the DataFrame in place and returns None. Default is False.
  - `verify_integrity`: If set to True, checks the new index for duplicates and raises an error if duplicates are found. Default is False.
  - `sort`: If set to True, sorts the new index. Default is True.
- **Example**:
  ```python
  import pandas as pd

  # Create a DataFrame
  df = pd.DataFrame({
      'A': [1, 2, 3],
      'B': ['x', 'y', 'z']
  })

  # Set column 'B' as the index
  new_df = df.set_index('B')

  print(new_df)
  ```
  Output:
  ```
     A
  B   
  x  1
  y  2
  z  3
  ```
- **Note**:
  - By default, `df.set_index()` returns a new DataFrame with the specified column(s) set as the index, while the original DataFrame remains unchanged. If you want to modify the original DataFrame in place, you can use the `inplace=True` parameter.
  - If the specified index column(s) contain duplicate values and `verify_integrity=True`, an error will be raised.

In [214]:
import pandas as pd

# Create a DataFrame
dff = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['x', 'y', 'z']
})

# Set column 'B' as the index
new_df = dff.set_index('B')

print(new_df)

   A
B   
x  1
y  2
z  3


In [215]:
temp = df.set_index('continent')
temp

Unnamed: 0_level_0,country,year,population,life_exp,gdp_cap
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Asia,Afghanistan,1952,8425333,28.801,779.445314
Asia,Afghanistan,1957,9240934,30.332,820.853030
Asia,Afghanistan,1962,10267083,31.997,853.100710
Asia,Afghanistan,1967,11537966,34.020,836.197138
Asia,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
Africa,Zimbabwe,1987,9216418,62.351,706.157306
Africa,Zimbabwe,1992,10704340,60.377,693.420786
Africa,Zimbabwe,1997,11404948,46.809,792.449960
Africa,Zimbabwe,2002,11926563,39.989,672.038623


In [216]:
temp.loc['Asia']

Unnamed: 0_level_0,country,year,population,life_exp,gdp_cap
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Asia,Afghanistan,1952,8425333,28.801,779.445314
Asia,Afghanistan,1957,9240934,30.332,820.853030
Asia,Afghanistan,1962,10267083,31.997,853.100710
Asia,Afghanistan,1967,11537966,34.020,836.197138
Asia,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
Asia,"Yemen, Rep.",1987,11219340,52.922,1971.741538
Asia,"Yemen, Rep.",1992,13367997,55.599,1879.496673
Asia,"Yemen, Rep.",1997,15826497,58.020,2117.484526
Asia,"Yemen, Rep.",2002,18701257,60.308,2234.820827


In [217]:
temp.loc['Asia'].index.value_counts()

Asia    396
Name: continent, dtype: int64

In [218]:
a = df.set_index('country')
a

Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,1952,8425333,Asia,28.801,779.445314
Afghanistan,1957,9240934,Asia,30.332,820.853030
Afghanistan,1962,10267083,Asia,31.997,853.100710
Afghanistan,1967,11537966,Asia,34.020,836.197138
Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...
Zimbabwe,1987,9216418,Africa,62.351,706.157306
Zimbabwe,1992,10704340,Africa,60.377,693.420786
Zimbabwe,1997,11404948,Africa,46.809,792.449960
Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [219]:
a.loc['India']


Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
India,1952,372000000,Asia,37.373,546.565749
India,1957,409000000,Asia,40.249,590.061996
India,1962,454000000,Asia,43.605,658.347151
India,1967,506000000,Asia,47.193,700.770611
India,1972,567000000,Asia,50.651,724.032527
India,1977,634000000,Asia,54.208,813.337323
India,1982,708000000,Asia,56.596,855.723538
India,1987,788000000,Asia,58.553,976.512676
India,1992,872000000,Asia,60.223,1164.406809
India,1997,959000000,Asia,61.765,1458.817442


In [220]:
a.loc['India'].index.value_counts()

India    12
Name: country, dtype: int64

### **pd.reset_index()**

The **`df.reset_index()`** method in Pandas is used to reset the index of a DataFrame back to the default integer index. Here's an explanation of its functionality:

- **Method**: `df.reset_index()`
- **Input**:
  - Parameters are available to control the behavior of the operation, such as `level`, `drop`, `inplace`, and `col_level`.
- **Output**: Returns a new DataFrame with the index reset to the default integer index.
- **Usage**:
  - Typically used when you want to revert the index of a DataFrame to the default integer index after setting a different index.
- **Parameters**:
  - `level`: Specifies the level(s) of the index to be reset. If the index is a MultiIndex, this parameter can be used to reset only a specific level. Default is None, which resets all levels.
  - `drop`: If set to True, drops the current index instead of adding it as a new column in the DataFrame. Default is False.
  - `inplace`: If set to True, modifies the DataFrame in place and returns None. Default is False.
  - `col_level`: If the DataFrame has a MultiIndex column, this parameter specifies which level(s) of the column index to reset. Default is 0.
- **Example**:
  ```python
  import pandas as pd

  # Create a DataFrame with a custom index
  df = pd.DataFrame({
      'A': [1, 2, 3],
      'B': ['x', 'y', 'z']
  }, index=['a', 'b', 'c'])

  # Reset the index
  new_df = df.reset_index()

  print(new_df)
  ```
  Output:
  ```
    index  A  B
  0     a  1  x
  1     b  2  y
  2     c  3  z
  ```
- **Note**:
  - By default, `df.reset_index()` returns a new DataFrame with the index reset to the default integer index, while the original DataFrame remains unchanged. If you want to modify the original DataFrame in place, you can use the `inplace=True` parameter.
  - If the DataFrame has a MultiIndex, you can specify the `level` parameter to reset only a specific level of the index.
  - If you don't want to keep the current index as a column in the DataFrame, you can set `drop=True`.

In [221]:
import pandas as pd

# Create a DataFrame with a custom index
data = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['x', 'y', 'z']
}, index=['a', 'b', 'c'])
data

Unnamed: 0,A,B
a,1,x
b,2,y
c,3,z


In [222]:
# Reset the index
new_df = data.reset_index(drop = True)

print(new_df)

   A  B
0  1  x
1  2  y
2  3  z


In [223]:
df.reset_index()

Unnamed: 0,index,country,year,population,continent,life_exp,gdp_cap
0,0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...,...
1699,1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [224]:
df.reset_index(drop = True)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [225]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [226]:
#permanent changes
# 1. df = ________
# 2. inplace = True

# Working on rows

In [227]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [228]:
df = df.reset_index(drop = True)
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [229]:
len(df.index)

1704

In [230]:
# Dropping of indices

# 1. Temporary drop
# df.drop(index = [row names/slice the rows that need to be dropped])
# print(df)


# 2. Permanent drop
# A. df.drop(index = [row names/slice the rows that need to be dropped], inplace = True)
# B. df = df.drop(index = [row names/slice the rows that need to be dropped])



In [231]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [232]:
df.drop(index = np.arange(100), inplace = False) #This will drop first 100 rows of the table.

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
100,Bangladesh,1972,70759295,Asia,45.252,630.233627
101,Bangladesh,1977,80428306,Asia,46.923,659.877232
102,Bangladesh,1982,93074406,Asia,50.009,676.981866
103,Bangladesh,1987,103764241,Asia,52.819,751.979403
104,Bangladesh,1992,113704579,Asia,56.018,837.810164
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [233]:
# df.drop(index = [:100])

In [234]:
df.iloc[100:]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
100,Bangladesh,1972,70759295,Asia,45.252,630.233627
101,Bangladesh,1977,80428306,Asia,46.923,659.877232
102,Bangladesh,1982,93074406,Asia,50.009,676.981866
103,Bangladesh,1987,103764241,Asia,52.819,751.979403
104,Bangladesh,1992,113704579,Asia,56.018,837.810164
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


In [235]:
len(df.iloc[100:].index.value_counts())

1604

In [236]:
new_row = {'country' : 'India', 'year' : '2000', 'population' : 13500000, 'life_exp' : 37.08, 'gdp_cap' : 900.23}
new_row

# new_row = {'country' : 'India', 'year' : '2000', 'population' : 1213748, 'life_exp' : 382.23, 'gdp_cap' : 1873 }

{'country': 'India',
 'year': '2000',
 'population': 13500000,
 'life_exp': 37.08,
 'gdp_cap': 900.23}

### **pd.append()**

The `df.append()` method in Pandas is used to append rows of one DataFrame to another DataFrame. Here's an explanation of its functionality:

- **Method**: `df.append()`
- **Input**:
  - The DataFrame or Series you want to append to another DataFrame.
  - Parameters are available to control the behavior of the operation, such as `ignore_index`, `verify_integrity`, and `sort`.
- **Output**: Returns a new DataFrame with the rows appended from the specified DataFrame or Series.
- **Usage**:
  - Typically used when you want to combine two DataFrames vertically by adding rows from one DataFrame to the end of another.
- **Parameters**:
  - `other`: The DataFrame or Series to append to the original DataFrame.
  - `ignore_index`: If set to True, resets the index of the appended DataFrame to have a continuous integer index. Default is False.
  - `verify_integrity`: If set to True, checks whether the index of the appended DataFrame is unique and raises an error if duplicates are found. Default is False.
  - `sort`: If set to True, sorts the columns of the appended DataFrame to match the order of the columns in the original DataFrame. Default is False.
- **Example**:
  ```python
  import pandas as pd

  # Create two DataFrames
  df1 = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
  df2 = pd.DataFrame({'A': [3, 4], 'B': ['z', 'w']})

  # Append df2 to df1
  new_df = df1.append(df2)

  print(new_df)
  ```
  Output:
  ```
     A  B
  0  1  x
  1  2  y
  0  3  z
  1  4  w
  ```
- **Note**:
  - By default, `df.append()` returns a new DataFrame with the rows appended from the specified DataFrame or Series, while the original DataFrame remains unchanged. If you want to modify the original DataFrame in place, you can use the `inplace=True` parameter.
  - If the index of the appended DataFrame has duplicate values and `verify_integrity=True`, an error will be raised.
  - If the column names of the appended DataFrame differ from the column names of the original DataFrame, the columns will be aligned based on their names. Use the `sort=True` parameter to sort the columns of the appended DataFrame to match the order of the columns in the original DataFrame.

In [264]:
# Adding/appending new_row
df.append(new_row)

  df.append(new_row)


TypeError: Can only append a dict if ignore_index=True

In [238]:
print(df.append(new_row, ignore_index = True)) # Makes changes to the original data

print()

print(len(df.append(new_row, ignore_index = True)))

          country  year  population continent  life_exp     gdp_cap
0     Afghanistan  1952     8425333      Asia    28.801  779.445314
1     Afghanistan  1957     9240934      Asia    30.332  820.853030
2     Afghanistan  1962    10267083      Asia    31.997  853.100710
3     Afghanistan  1967    11537966      Asia    34.020  836.197138
4     Afghanistan  1972    13079460      Asia    36.088  739.981106
...           ...   ...         ...       ...       ...         ...
1700     Zimbabwe  1992    10704340    Africa    60.377  693.420786
1701     Zimbabwe  1997    11404948    Africa    46.809  792.449960
1702     Zimbabwe  2002    11926563    Africa    39.989  672.038623
1703     Zimbabwe  2007    12311143    Africa    43.487  469.709298
1704        India  2000    13500000       NaN    37.080  900.230000

[1705 rows x 6 columns]

1705


  print(df.append(new_row, ignore_index = True)) # Makes changes to the original data
  print(len(df.append(new_row, ignore_index = True)))


In [239]:
len(df.index)

1704

In [240]:
df.loc[1702] = ["India", 2012, 13500000, 'Asia', 37.080, 905.2] # Modifying or Updating the rows

In [241]:
df # observe index no 1702 row before and after updation

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,India,2012,13500000,Asia,37.080,905.200000


In [242]:
len(df.index)

1704

In [243]:
# Modifying/Updating the Rows.

df.loc[2] = ['Afghanistan',1957, 9240934, 'Asia', 30.332+5, 820.853030]

In [244]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1957,9240934,Asia,35.332,820.853030
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,India,2012,13500000,Asia,37.080,905.200000


In [245]:
# update permanently

df.drop(index = [2], inplace = True)

In [246]:
df # row with label 2 is dropped

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
5,Afghanistan,1977,14880372,Asia,38.438,786.113360
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,India,2012,13500000,Asia,37.080,905.200000


In [247]:
df.drop([0,3,5], inplace = True)
df # rows with labels 0,3,5 are dropped

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
6,Afghanistan,1982,12881816,Asia,39.854,978.011439
7,Afghanistan,1987,13867957,Asia,40.822,852.395945
8,Afghanistan,1992,16317921,Asia,41.674,649.341395
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,India,2012,13500000,Asia,37.080,905.200000


In [248]:
df = df.reset_index(drop = True)
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
1,Afghanistan,1972,13079460,Asia,36.088,739.981106
2,Afghanistan,1982,12881816,Asia,39.854,978.011439
3,Afghanistan,1987,13867957,Asia,40.822,852.395945
4,Afghanistan,1992,16317921,Asia,41.674,649.341395
...,...,...,...,...,...,...
1695,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1696,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1697,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1698,India,2012,13500000,Asia,37.080,905.200000


In [249]:
len(df.index)

1700

In [250]:
df.loc[len(df.index)] = ['India',2000,13500000, "Asia",37.08,900.23]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,"Asia",80.00,500.00]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,"Asia",80.00,500.00]
df.loc[len(df.index)] = ['India',2000 ,13500000,"Asia",80.00,900.23]
df


# we added the 4 new rows

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
1,Afghanistan,1972,13079460,Asia,36.088,739.981106
2,Afghanistan,1982,12881816,Asia,39.854,978.011439
3,Afghanistan,1987,13867957,Asia,40.822,852.395945
4,Afghanistan,1992,16317921,Asia,41.674,649.341395
...,...,...,...,...,...,...
1699,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1700,India,2000,13500000,Asia,37.080,900.230000
1701,Sri Lanka,2022,130000000,Asia,80.000,500.000000
1702,Sri Lanka,2022,130000000,Asia,80.000,500.000000


### **pd.duplicated()**

The `df.duplicated()` method in Pandas is used to identify duplicate rows in a DataFrame. Here's an explanation of its functionality:

- **Method**: `df.duplicated()`
- **Input**:
  - Parameters are available to control the behavior of the operation, such as `subset` and `keep`.
- **Output**: Returns a Boolean Series indicating whether each row in the DataFrame is a duplicate of a previous row.
- **Usage**:
  - Typically used when you want to identify and remove duplicate rows from a DataFrame, or perform other operations based on the presence of duplicates.
- **Parameters**:
  - `subset`: Specifies the subset of columns to consider when identifying duplicates. By default, all columns are used.
  - `keep`: Specifies how to mark duplicates. Possible values are:
    - `'first'`: Mark duplicates as True except for the first occurrence.
    - `'last'`: Mark duplicates as True except for the last occurrence.
    - `False`: Mark all duplicates as True.
- **Example**:
  ```python
  import pandas as pd

  # Create a DataFrame
  df = pd.DataFrame({
      'A': [1, 2, 3, 1, 2],
      'B': ['x', 'y', 'z', 'x', 'y']
  })

  # Check for duplicate rows
  duplicates = df.duplicated()

  print(duplicates)
  ```
  Output:
  ```
  0    False
  1    False
  2    False
  3     True
  4     True
  dtype: bool
  ```
- **Note**:
  - By default, `df.duplicated()` returns a Boolean Series where True indicates that the corresponding row is a duplicate of a previous row.
  - You can use this method in conjunction with other methods like `df.drop_duplicates()` to remove duplicate rows from the DataFrame.
  - If you want to consider only a subset of columns when identifying duplicates, you can specify the `subset` parameter.

In [251]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702     True
1703    False
Length: 1704, dtype: bool

In [255]:
## extracting duplicated rows.
df.loc[df.duplicated()]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1702,Sri Lanka,2022,130000000,Asia,80.0,500.0


In [256]:
len(df.loc[df.duplicated()])

1

### **pd.drop_duplicates**

The `df.drop_duplicates()` method in Pandas is used to remove duplicate rows from a DataFrame. Here's an explanation of its functionality:

- **Method**: `df.drop_duplicates()`
- **Input**:
  - Parameters are available to control the behavior of the operation, such as `subset`, `keep`, and `inplace`.
- **Output**: Returns a new DataFrame with duplicate rows removed.
- **Usage**:
  - Typically used when you want to eliminate duplicate rows from a DataFrame, keeping only the first occurrence or the last occurrence of each unique row.
- **Parameters**:
  - `subset`: Specifies the subset of columns to consider when identifying duplicates. By default, all columns are used.
  - `keep`: Specifies which occurrence(s) of duplicates to keep. Possible values are:
    - `'first'`: Keep the first occurrence of each duplicate row (default behavior).
    - `'last'`: Keep the last occurrence of each duplicate row.
    - `False`: Drop all duplicate rows.
  - `inplace`: If set to True, modifies the DataFrame in place and returns None. Default is False.
- **Example**:
  ```python
  import pandas as pd

  # Create a DataFrame with duplicate rows
  df = pd.DataFrame({
      'A': [1, 2, 3, 1, 2],
      'B': ['x', 'y', 'z', 'x', 'y']
  })

  # Remove duplicate rows
  new_df = df.drop_duplicates()

  print(new_df)
  ```
  Output:
  ```
     A  B
  0  1  x
  1  2  y
  2  3  z
  ```
- **Note**:
  - By default, `df.drop_duplicates()` returns a new DataFrame with duplicate rows removed, while the original DataFrame remains unchanged. If you want to modify the original DataFrame in place, you can use the `inplace=True` parameter.
  - You can specify the `subset` parameter to consider only a subset of columns when identifying duplicates.
  - The `keep` parameter allows you to specify whether to keep the first occurrence, the last occurrence, or drop all occurrences of duplicates.

In [257]:
df.drop_duplicates()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
1,Afghanistan,1972,13079460,Asia,36.088,739.981106
2,Afghanistan,1982,12881816,Asia,39.854,978.011439
3,Afghanistan,1987,13867957,Asia,40.822,852.395945
4,Afghanistan,1992,16317921,Asia,41.674,649.341395
...,...,...,...,...,...,...
1698,India,2012,13500000,Asia,37.080,905.200000
1699,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1700,India,2000,13500000,Asia,37.080,900.230000
1701,Sri Lanka,2022,130000000,Asia,80.000,500.000000


In [258]:
len(df.drop_duplicates())

1703

In [260]:
'''When you apply df.drop_duplicates(subset=['country', 'year', 'population']),
it will remove rows that have duplicate combinations of values in the specified columns ['country', 'year', 'population'].'''


df.drop_duplicates(subset = ['country', 'year', 'population'])

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
1,Afghanistan,1972,13079460,Asia,36.088,739.981106
2,Afghanistan,1982,12881816,Asia,39.854,978.011439
3,Afghanistan,1987,13867957,Asia,40.822,852.395945
4,Afghanistan,1992,16317921,Asia,41.674,649.341395
...,...,...,...,...,...,...
1697,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1698,India,2012,13500000,Asia,37.080,905.200000
1699,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1700,India,2000,13500000,Asia,37.080,900.230000


# working with rows and columns

In [261]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
1,Afghanistan,1972,13079460,Asia,36.088,739.981106
2,Afghanistan,1982,12881816,Asia,39.854,978.011439
3,Afghanistan,1987,13867957,Asia,40.822,852.395945
4,Afghanistan,1992,16317921,Asia,41.674,649.341395
...,...,...,...,...,...,...
1699,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1700,India,2000,13500000,Asia,37.080,900.230000
1701,Sri Lanka,2022,130000000,Asia,80.000,500.000000
1702,Sri Lanka,2022,130000000,Asia,80.000,500.000000


In [262]:
df.iloc[0:5, 0:3]

Unnamed: 0,country,year,population
0,Afghanistan,1957,9240934
1,Afghanistan,1972,13079460
2,Afghanistan,1982,12881816
3,Afghanistan,1987,13867957
4,Afghanistan,1992,16317921


In [265]:
df.loc[1:5, 1:4] # in column indexing there are no explicit columns with names 1 to 4

TypeError: cannot do slice indexing on Index with these indexers [1] of type int

In [266]:
df.loc[1:5, 'country' : 'population'] # This will print country, year, population columns and 1,2,3,4,5 rows

Unnamed: 0,country,year,population
1,Afghanistan,1972,13079460
2,Afghanistan,1982,12881816
3,Afghanistan,1987,13867957
4,Afghanistan,1992,16317921
5,Afghanistan,1997,22227415


In [267]:
df.loc[1:4, ['life_exp', 'country']] # This will print only life_exp and country columns and 1,2,3,4 rows

Unnamed: 0,life_exp,country
1,36.088,Afghanistan
2,39.854,Afghanistan
3,40.822,Afghanistan
4,41.674,Afghanistan


In [268]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
1,Afghanistan,1972,13079460,Asia,36.088,739.981106
2,Afghanistan,1982,12881816,Asia,39.854,978.011439
3,Afghanistan,1987,13867957,Asia,40.822,852.395945
4,Afghanistan,1992,16317921,Asia,41.674,649.341395
...,...,...,...,...,...,...
1699,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1700,India,2000,13500000,Asia,37.080,900.230000
1701,Sri Lanka,2022,130000000,Asia,80.000,500.000000
1702,Sri Lanka,2022,130000000,Asia,80.000,500.000000


In [269]:
df.iloc[0:10:2, -3:]

Unnamed: 0,continent,life_exp,gdp_cap
0,Asia,30.332,820.85303
2,Asia,39.854,978.011439
4,Asia,41.674,649.341395
6,Asia,42.129,726.734055
8,Europe,55.23,1601.056136


In [271]:
# df.fillna(-1)
# df['colum_name'].fillna(____)

In [270]:
df['life_exp']

0       30.332
1       36.088
2       39.854
3       40.822
4       41.674
         ...  
1699    43.487
1700    37.080
1701    80.000
1702    80.000
1703    80.000
Name: life_exp, Length: 1704, dtype: float64

In [272]:
#Finding minimum value
df['life_exp'].min()

23.599

In [273]:
# Finding maximum values
df['life_exp'].max()

82.603

In [274]:
# Finding average values
df['life_exp'].mean()

59.55713596244132

In [275]:
df['life_exp'].sum()

101485.35968000001

In [276]:
print(df['life_exp'].count())
print(len(df.index))

1704
1704


### **pd.sort_values()**

The `df.sort_values()` method in Pandas is used to sort the rows of a DataFrame by the values in one or more columns. Here's an explanation of its functionality:

- **Method**: `df.sort_values()`
- **Input**:
  - Parameters include `by`, `axis`, `ascending`, `inplace`, `na_position`, `ignore_index`, and `key`.
  - `by`: Specifies the column(s) to sort by. It can be a single column name or a list of column names.
- **Output**: Returns a new DataFrame with the rows sorted by the specified column(s).
- **Usage**:
  - Typically used when you want to arrange the rows of a DataFrame in a specific order based on the values in one or more columns.
- **Parameters**:
  - `by`: Specifies the column(s) to sort by. It can be a single column name or a list of column names.
  - `axis`: Specifies whether to sort rows (`axis=0`) or columns (`axis=1`). Default is 0 (rows).
  - `ascending`: Specifies whether to sort in ascending order (True) or descending order (False). Default is True.
  - `inplace`: If set to True, modifies the DataFrame in place and returns None. Default is False.
  - `na_position`: Specifies where NaNs should be placed in the sorted result. Options are `'first'`, `'last'`, or `'keep'`. Default is `'last'`.
  - `ignore_index`: If set to True, resets the index of the resulting DataFrame to have a continuous integer index. Default is False.
  - `key`: Specifies a function to be applied to the values before sorting.
- **Example**:
  ```python
  import pandas as pd

  # Create a DataFrame
  df = pd.DataFrame({
      'A': [2, 1, 3],
      'B': ['x', 'y', 'z']
  })

  # Sort the DataFrame by column 'A' in ascending order
  sorted_df = df.sort_values(by='A')

  print(sorted_df)
  ```
  Output:
  ```
     A  B
  1  1  y
  0  2  x
  2  3  z
  ```
- **Note**:
  - By default, `df.sort_values()` returns a new DataFrame with the rows sorted by the specified column(s), while the original DataFrame remains unchanged. If you want to modify the original DataFrame in place, you can use the `inplace=True` parameter.
  - You can specify multiple columns in the `by` parameter to perform a hierarchical sort based on multiple criteria.
  - The `ascending` parameter allows you to control the sorting order (ascending or descending).

In [278]:
df.sort_values(by = 'life_exp')

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1288,Rwanda,1992,7290203,Africa,23.599,737.068595
548,Gambia,1952,284320,Africa,30.000,485.230659
32,Angola,1952,4232095,Africa,30.015,3520.610273
1340,Sierra Leone,1952,2143249,Africa,30.331,879.787736
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
...,...,...,...,...,...,...
1483,Switzerland,2007,7554661,Europe,81.701,37506.419070
691,Iceland,2007,301931,Europe,81.757,36180.789190
798,Japan,2002,127065841,Asia,82.000,28604.591900
667,"Hong Kong, China",2007,6980412,Asia,82.208,39724.978670


In [279]:
df.sort_values(by = 'life_exp', ascending = False)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
799,Japan,2007,127467972,Asia,82.603,31656.068060
667,"Hong Kong, China",2007,6980412,Asia,82.208,39724.978670
798,Japan,2002,127065841,Asia,82.000,28604.591900
691,Iceland,2007,301931,Europe,81.757,36180.789190
1483,Switzerland,2007,7554661,Europe,81.701,37506.419070
...,...,...,...,...,...,...
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
1340,Sierra Leone,1952,2143249,Africa,30.331,879.787736
32,Angola,1952,4232095,Africa,30.015,3520.610273
548,Gambia,1952,284320,Africa,30.000,485.230659


In [280]:
df.sort_values(by = 'life_exp', ascending = True)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1288,Rwanda,1992,7290203,Africa,23.599,737.068595
548,Gambia,1952,284320,Africa,30.000,485.230659
32,Angola,1952,4232095,Africa,30.015,3520.610273
1340,Sierra Leone,1952,2143249,Africa,30.331,879.787736
0,Afghanistan,1957,9240934,Asia,30.332,820.853030
...,...,...,...,...,...,...
1483,Switzerland,2007,7554661,Europe,81.701,37506.419070
691,Iceland,2007,301931,Europe,81.757,36180.789190
798,Japan,2002,127065841,Asia,82.000,28604.591900
667,"Hong Kong, China",2007,6980412,Asia,82.208,39724.978670


In [129]:
'''When you apply df.sort_values(by=['year', 'life_exp'], ascending=False),
it will sort the DataFrame first by the 'year' column in descending order and then by the 'life_exp' column in descending order.
The resulting DataFrame will be'''

df.sort_values(by = ['year', 'life_exp'], ascending = False)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1701,Sri Lanka,2022,130000000,Asia,80.000,500.000000
1702,Sri Lanka,2022,130000000,Asia,80.000,500.000000
1698,India,2012,13500000,Asia,37.080,905.200000
799,Japan,2007,127467972,Asia,82.603,31656.068060
667,"Hong Kong, China",2007,6980412,Asia,82.208,39724.978670
...,...,...,...,...,...,...
188,Burkina Faso,1952,4469979,Africa,31.975,543.255241
1028,Mozambique,1952,6446316,Africa,31.286,468.526038
1340,Sierra Leone,1952,2143249,Africa,30.331,879.787736
32,Angola,1952,4232095,Africa,30.015,3520.610273


In [130]:
df.sort_values(by = ['year','life_exp'], ascending = [False,True])
'''
1. df: This is assumed to be a Pandas DataFrame.

2. sort_values: This is a Pandas method used to sort the DataFrame by specified column(s).

3. by=['year', 'life_exp']: This parameter specifies the columns by which the DataFrame should be sorted. In this case, it will first sort by the 'year' column in descending order (ascending=False), and then within each 'year', it will further sort by the 'life_exp' column in ascending order (ascending=True).

4. Sorting by 'year' in descending order means that the DataFrame will be arranged so that the highest years come first.
Sorting by 'life_exp' in ascending order within each year means that, for a given year, the rows will be ordered so that the lowest 'life_exp' values come first.
ascending=[False, True]: This parameter specifies the sort order for each of the columns listed in the by parameter. In this case, 'year' is sorted in descending order (False), and 'life_exp' is sorted in ascending order (True).
'''

"\n1. df: This is assumed to be a Pandas DataFrame.\n\n2. sort_values: This is a Pandas method used to sort the DataFrame by specified column(s).\n\n3. by=['year', 'life_exp']: This parameter specifies the columns by which the DataFrame should be sorted. In this case, it will first sort by the 'year' column in descending order (ascending=False), and then within each 'year', it will further sort by the 'life_exp' column in ascending order (ascending=True).\n\n4. Sorting by 'year' in descending order means that the DataFrame will be arranged so that the highest years come first.\nSorting by 'life_exp' in ascending order within each year means that, for a given year, the rows will be ordered so that the lowest 'life_exp' values come first.\nascending=[False, True]: This parameter specifies the sort order for each of the columns listed in the by parameter. In this case, 'year' is sorted in descending order (False), and 'life_exp' is sorted in ascending order (True).\n"

### **pd.concat()**

The `pd.concat()` function in Pandas is used to concatenate (i.e., join) two or more DataFrames along a particular axis. Here's an explanation of its functionality:

- **Function**: `pd.concat()`
- **Input**:
  - A sequence (e.g., list) of DataFrames or Series objects to concatenate.
  - Parameters such as `axis`, `join`, `ignore_index`, `keys`, `sort`, and `verify_integrity` can be used to control the behavior of the operation.
- **Output**: Returns a new concatenated DataFrame or Series.
- **Usage**:
  - Typically used when you want to combine multiple DataFrames or Series either vertically (along rows) or horizontally (along columns).
- **Parameters**:
  - `objs`: A sequence (e.g., list) of DataFrames or Series objects to concatenate.
  - `axis`: Specifies the axis along which the concatenation should be done. Default is 0 (rows). Use `axis=1` for concatenation along columns.
  - `join`: Specifies how to handle indexes of the concatenated objects. Options include `'inner'`, `'outer'`, `'left'`, and `'right'`. Default is `'outer'`.
  - `ignore_index`: If set to True, resets the index of the resulting DataFrame to have a continuous integer index. Default is False.
  - `keys`: Creates a hierarchical index (MultiIndex) from the keys. Useful for identifying the source of each part of the concatenated DataFrame.
  - `sort`: Specifies whether to sort the result by the index. Default is False.
  - `verify_integrity`: If set to True, checks whether the result contains duplicate index values and raises an error if duplicates are found. Default is False.
- **Example**:
  ```python
  import pandas as pd

  # Create two DataFrames
  df1 = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
  df2 = pd.DataFrame({'A': [3, 4], 'B': ['z', 'w']})

  # Concatenate the two DataFrames along rows
  concatenated_df = pd.concat([df1, df2])

  print(concatenated_df)
  ```
  Output:
  ```
     A  B
  0  1  x
  1  2  y
  0  3  z
  1  4  w
  ```
- **Note**:
  - By default, `pd.concat()` concatenates along rows (`axis=0`). You can specify `axis=1` for concatenation along columns.
  - The `join` parameter specifies how to handle the indexes of the concatenated objects. `'outer'` takes the union of the indexes, `'inner'` takes the intersection, while `'left'` and `'right'` use the indexes of the left or right DataFrame, respectively.
  - If you want to combine DataFrames with different column names, use `join='outer'` to keep all columns from both DataFrames or `join='inner'` to keep only the common columns.

In [281]:
users = pd.DataFrame({'user_id' : [1,2,3], 'name' : ["sharadh", "shahid", "khusalli"]})
users

Unnamed: 0,user_id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [282]:
msgs = pd.DataFrame({'user_id' : [1,1,2,4], 'msg' : ['hmm', 'acha', 'theek hai', 'nice']})
msgs

Unnamed: 0,user_id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [283]:
pd.concat([users,msgs]) # by default axis = 0

Unnamed: 0,user_id,name,msg
0,1,sharadh,
1,2,shahid,
2,3,khusalli,
0,1,,hmm
1,1,,acha
2,2,,theek hai
3,4,,nice


In [284]:
pd.concat([users,msgs], ignore_index = True) # by default axis = 0

Unnamed: 0,user_id,name,msg
0,1,sharadh,
1,2,shahid,
2,3,khusalli,
3,1,,hmm
4,1,,acha
5,2,,theek hai
6,4,,nice


In [285]:
#pd.concat([users,msgs], ignore_index = True)
'''This code concatenates the DataFrames users and msgs along the rows (axis=0).
The ignore_index=True parameter is used to reset the index of the resulting DataFrame. When set to True,
it will create a new index for the concatenated DataFrame, ignoring the existing indices of users and msgs.
The new index will be a range from 0 to the total number of rows in the concatenated DataFrame.
This is useful when you want a fresh index for the concatenated DataFrame without keeping the original indices.'''

'This code concatenates the DataFrames users and msgs along the rows (axis=0).\nThe ignore_index=True parameter is used to reset the index of the resulting DataFrame. When set to True,\nit will create a new index for the concatenated DataFrame, ignoring the existing indices of users and msgs.\nThe new index will be a range from 0 to the total number of rows in the concatenated DataFrame.\nThis is useful when you want a fresh index for the concatenated DataFrame without keeping the original indices.'

In [286]:
pd.concat([users, msgs], ignore_index = True, axis = 0)

Unnamed: 0,user_id,name,msg
0,1,sharadh,
1,2,shahid,
2,3,khusalli,
3,1,,hmm
4,1,,acha
5,2,,theek hai
6,4,,nice


In [287]:
users

Unnamed: 0,user_id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [288]:
msgs

Unnamed: 0,user_id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [289]:
pd.concat([users,msgs], axis = 1)

Unnamed: 0,user_id,name,user_id.1,msg
0,1.0,sharadh,1,hmm
1,2.0,shahid,1,acha
2,3.0,khusalli,2,theek hai
3,,,4,nice


In [140]:
# axis = 0 concatination is same like union in SQL
# axis = 1 concatination is like joining side by side of two tables


### **pd.merge**

The `pd.merge()` function in Pandas is used to merge two or more DataFrames based on one or more common columns. Here's an explanation of its functionality:

- **Function**: `pd.merge()`
- **Input**:
  - DataFrames to be merged.
  - Parameters such as `on`, `how`, `left_on`, `right_on`, `left_index`, `right_index`, `suffixes`, and `indicator` can be used to control the merging behavior.
- **Output**: Returns a new DataFrame with the result of the merge operation.
- **Usage**:
  - Typically used when you want to combine information from multiple DataFrames into a single DataFrame based on a common column or index.
- **Parameters**:
  - `left`: The left DataFrame to be merged.
  - `right`: The right DataFrame to be merged.
  - `on`: Column or index level names to join on. Must be found in both DataFrames. If not specified and the DataFrames have overlapping column names, it will merge on those columns.
  - `how`: Specifies the type of merge to perform. Options include `'inner'`, `'outer'`, `'left'`, and `'right'`. Default is `'inner'`.
  - `left_on`, `right_on`: Columns or index level names to join on in the left and right DataFrames, respectively.
  - `left_index`, `right_index`: If True, use the index (row labels) of the DataFrame as the join keys. Default is False.
  - `suffixes`: A tuple of string suffixes to apply to overlapping column names from the left and right DataFrames, respectively, in case of name conflicts. Default is `('_x', '_y')`.
  - `indicator`: Adds a special column `_merge` to the output DataFrame that indicates the source of each row. Useful for understanding the result of the merge operation.
- **Example**:
  ```python
  import pandas as pd

  # Create two DataFrames
  df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
  df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})

  # Merge the two DataFrames based on the 'key' column
  merged_df = pd.merge(df1, df2, on='key', how='inner')

  print(merged_df)
  ```
  Output:
  ```
   key  value_x  value_y
  0   A        1        4
  1   B        2        5
  ```
- **Note**:
  - By default, `pd.merge()` performs an inner join, which returns only the rows that have matching keys in both DataFrames. You can specify different types of joins using the `how` parameter (`'inner'`, `'outer'`, `'left'`, `'right'`).
  - The `on` parameter specifies the column or index level names to join on. If not specified and the DataFrames have overlapping column names, it will merge on those columns.
  - You can merge based on multiple columns by passing a list to the `on`, `left_on`, or `right_on` parameters.
  - If you want to merge based on the index of the DataFrames, you can set `left_index=True` and/or `right_index=True`.

In [290]:
users

Unnamed: 0,user_id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [291]:
msgs

Unnamed: 0,user_id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [293]:
pd.merge(users,msgs, how = 'inner', on = 'user_id')

Unnamed: 0,user_id,name,msg
0,1,sharadh,hmm
1,1,sharadh,acha
2,2,shahid,theek hai


In [294]:
users

Unnamed: 0,user_id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [295]:
msgs

Unnamed: 0,user_id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [296]:
pd.merge(users,msgs, how = 'left', on = 'user_id')

Unnamed: 0,user_id,name,msg
0,1,sharadh,hmm
1,1,sharadh,acha
2,2,shahid,theek hai
3,3,khusalli,


In [298]:
users

Unnamed: 0,user_id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [299]:
msgs

Unnamed: 0,user_id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [300]:
pd.merge(users,msgs,how = 'right', on = 'user_id')

Unnamed: 0,user_id,name,msg
0,1,sharadh,hmm
1,1,sharadh,acha
2,2,shahid,theek hai
3,4,,nice


In [301]:
users

Unnamed: 0,user_id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [302]:
msgs

Unnamed: 0,user_id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [152]:
pd.merge(users,msgs, how = 'outer', on = 'user_id')

Unnamed: 0,user_id,name,msg
0,1,sharadh,hmm
1,1,sharadh,acha
2,2,shahid,theek hai
3,3,khusalli,
4,4,,nice


### **pd.rename()**

The `pd.rename()` function in Pandas is used to rename columns or indexes of a DataFrame. Here's an explanation of its functionality:

- **Function**: `pd.rename()`
- **Input**:
  - Parameters include `columns`, `index`, `mapper`, `inplace`, `axis`, `copy`, `level`, `errors`, and `regex`.
- **Output**: Returns a new DataFrame with the specified columns or index renamed.
- **Usage**:
  - Typically used when you want to change the names of one or more columns or indexes in a DataFrame.
- **Parameters**:
  - `columns`: A mapping dictionary, Series, or function to rename the columns. It can be a dictionary where keys are the current column names and values are the new column names, or it can be a function that accepts a column name and returns a new name.
  - `index`: A mapping dictionary, Series, or function to rename the index. Similar to `columns`, but for index labels.
  - `mapper`: A mapping dictionary, Series, or function that can be used to rename both columns and index.
  - `inplace`: If set to True, modifies the DataFrame in place and returns None. Default is False.
  - `axis`: Specifies whether to rename columns (axis=1) or index (axis=0). Default is 1 (columns).
  - `copy`: If set to True, returns a copy of the DataFrame with the specified columns or index renamed. Default is True.
  - `level`: For hierarchical (MultiIndex) columns or index, specifies the level(s) to rename.
  - `errors`: Specifies how to handle errors if the specified columns or index do not exist. Options include `'raise'`, `'ignore'`, and `'coerce'`.
  - `regex`: If set to True, treats the names in `columns` and `index` as regular expressions when renaming.
- **Example**:
  ```python
  import pandas as pd

  # Create a DataFrame
  df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

  # Rename column 'A' to 'X'
  renamed_df = df.rename(columns={'A': 'X'})

  print(renamed_df)
  ```
  Output:
  ```
     X  B
  0  1  4
  1  2  5
  2  3  6
  ```
- **Note**:
  - By default, `pd.rename()` returns a new DataFrame with the specified columns or index renamed, while the original DataFrame remains unchanged. If you want to modify the original DataFrame in place, you can use the `inplace=True` parameter.
  - You can specify renaming using a dictionary, a Series, or a function.
  - The `axis` parameter allows you to specify whether to rename columns (axis=1) or index (axis=0).
  - If you want to rename both columns and index at the same time, you can use the `mapper` parameter.

In [304]:
users.rename(columns = {'user_id' : 'id'}, inplace = True)

In [154]:
users

Unnamed: 0,id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [305]:
users.rename(columns = {'id' : 'user_id'}, inplace = True)

In [306]:
users

Unnamed: 0,user_id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [307]:
users.rename(columns = {'user_id' : 'id'}, inplace = True)

In [308]:
users

Unnamed: 0,id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [309]:
msgs.rename(columns = {'user_id' : 'id'},inplace = True)

In [160]:
msgs

Unnamed: 0,id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [310]:
msgs.rename(columns = {'id' : 'user_id'}, inplace = True)

In [162]:
msgs

Unnamed: 0,user_id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [311]:
users

Unnamed: 0,id,name
0,1,sharadh
1,2,shahid
2,3,khusalli


In [312]:
msgs

Unnamed: 0,user_id,msg
0,1,hmm
1,1,acha
2,2,theek hai
3,4,nice


In [313]:
pd.merge(users,msgs, how = 'inner', on = 'user_id') # The error is due to the different user_id columns as we renamed user_id to id in msgs dataframe

KeyError: 'user_id'

In [314]:
pd.merge(users,msgs, how = 'inner', left_on = 'id', right_on = 'user_id')

Unnamed: 0,id,name,user_id,msg
0,1,sharadh,1,hmm
1,1,sharadh,1,acha
2,2,shahid,2,theek hai


In [315]:
df1 = pd.DataFrame({'A' : [10,30], 'B' : [20,40], 'C' : [40,60]})
df2 = pd.DataFrame({'A' : [10,20], 'C' : [30,60]})

In [170]:
df1

Unnamed: 0,A,B,C
0,10,20,40
1,30,40,60


In [171]:
df2

Unnamed: 0,A,C
0,10,30
1,20,60


In [172]:
df1.merge(df2, how = 'inner', on = 'A')

Unnamed: 0,A,B,C_x,C_y
0,10,20,40,30


# IMDB data Analysis

In [316]:
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm

Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 58.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 113MB/s]


In [317]:
movies = pd.read_csv('movies.csv')

In [318]:
movies

Unnamed: 0.1,Unnamed: 0,id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day
0,0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday
1,1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday
2,2,43599,245000000,107,880674609,Spectre,6.3,4466,4764,2015,Oct,Monday
3,3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday
4,5,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday
...,...,...,...,...,...,...,...,...,...,...,...,...
1460,4736,48363,0,3,321952,The Last Waltz,7.9,64,4809,1978,May,Monday
1461,4743,48370,27000,19,3151130,Clerks,7.4,755,5369,1994,Sep,Tuesday
1462,4748,48375,0,7,0,Rampage,6.0,131,5148,2009,Aug,Friday
1463,4749,48376,0,3,0,Slacker,6.4,77,5535,1990,Jul,Friday


In [319]:
movies.shape

(1465, 12)

In [320]:
len(movies.index)

1465

In [321]:
directors = pd.read_csv('directors.csv')
directors

Unnamed: 0.1,Unnamed: 0,director_name,id,gender
0,0,James Cameron,4762,Male
1,1,Gore Verbinski,4763,Male
2,2,Sam Mendes,4764,Male
3,3,Christopher Nolan,4765,Male
4,4,Andrew Stanton,4766,Male
...,...,...,...,...
2344,2344,Shane Carruth,7106,Male
2345,2345,Neill Dela Llana,7107,
2346,2346,Scott Smith,7108,
2347,2347,Daniel Hsia,7109,Male


In [322]:
directors.shape

(2349, 4)

In [323]:
len(directors.index)

2349

In [324]:
movies.head()

Unnamed: 0.1,Unnamed: 0,id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day
0,0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday
1,1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday
2,2,43599,245000000,107,880674609,Spectre,6.3,4466,4764,2015,Oct,Monday
3,3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday
4,5,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday


In [330]:
directors.head()

Unnamed: 0.1,Unnamed: 0,director_name,id,gender
0,0,James Cameron,4762,Male
1,1,Gore Verbinski,4763,Male
2,2,Sam Mendes,4764,Male
3,3,Christopher Nolan,4765,Male
4,4,Andrew Stanton,4766,Male


In [331]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1465 entries, 0 to 1464
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    1465 non-null   int64  
 1   id            1465 non-null   int64  
 2   budget        1465 non-null   int64  
 3   popularity    1465 non-null   int64  
 4   revenue       1465 non-null   int64  
 5   title         1465 non-null   object 
 6   vote_average  1465 non-null   float64
 7   vote_count    1465 non-null   int64  
 8   director_id   1465 non-null   int64  
 9   year          1465 non-null   int64  
 10  month         1465 non-null   object 
 11  day           1465 non-null   object 
dtypes: float64(1), int64(8), object(3)
memory usage: 137.5+ KB


In [332]:
movies.drop(columns = 'Unnamed: 0', inplace = True)

In [343]:
movies.rename(columns = {'id' : 'movie_id'}, inplace = True)
movies

Unnamed: 0,movie_id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day
0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday
2,43599,245000000,107,880674609,Spectre,6.3,4466,4764,2015,Oct,Monday
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday
...,...,...,...,...,...,...,...,...,...,...,...
1460,48363,0,3,321952,The Last Waltz,7.9,64,4809,1978,May,Monday
1461,48370,27000,19,3151130,Clerks,7.4,755,5369,1994,Sep,Tuesday
1462,48375,0,7,0,Rampage,6.0,131,5148,2009,Aug,Friday
1463,48376,0,3,0,Slacker,6.4,77,5535,1990,Jul,Friday


In [338]:
directors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2349 entries, 0 to 2348
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   director_name  2349 non-null   object
 1   id             2349 non-null   int64 
 2   gender         1724 non-null   object
dtypes: int64(1), object(2)
memory usage: 55.2+ KB


In [334]:
directors.drop(columns = 'Unnamed: 0', inplace = True)

In [341]:
pd.merge(movies, directors, how = 'inner', left_on = 'director_id', right_on = 'id')

Unnamed: 0,movie_id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day,director_name,id,gender
0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday,James Cameron,4762,Male
1,43622,200000000,100,1845034188,Titanic,7.5,7562,4762,1997,Nov,Tuesday,James Cameron,4762,Male
2,43876,100000000,101,520000000,Terminator 2: Judgment Day,7.7,4185,4762,1991,Jul,Monday,James Cameron,4762,Male
3,43879,115000000,38,378882411,True Lies,6.8,1116,4762,1994,Jul,Thursday,James Cameron,4762,Male
4,44184,70000000,24,90000098,The Abyss,7.1,808,4762,1989,Aug,Wednesday,James Cameron,4762,Male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1460,46859,0,14,25288872,Enough Said,6.6,348,6204,2013,Sep,Wednesday,Nicole Holofcener,6204,Female
1461,47023,6500000,11,13368437,Friends with Money,5.1,128,6204,2006,Sep,Thursday,Nicole Holofcener,6204,Female
1462,47524,3000000,5,0,Please Give,6.0,57,6204,2010,Jan,Friday,Nicole Holofcener,6204,Female
1463,47962,0,0,0,Walking and Talking,6.6,7,6204,1996,Jul,Wednesday,Nicole Holofcener,6204,Female


In [325]:
import pandas as pd

data = {
    'name': ['Elon', 'Jeff', 'Bill', 'Falguni'],
    'gender': ['M', 'F', 'M', 'F'],
    'income': [53000, 28000, 25000, 44000]
}

df = pd.DataFrame(data)

In [344]:
df

Unnamed: 0,name,gender,income
0,Elon,M,53000
1,Jeff,F,28000
2,Bill,M,25000
3,Falguni,F,44000


In [345]:
import pandas as pd

def max_income(df):
    '''
    INPUT: df -> dataframe

    OUTPUT: result -> String

    '''

    ### STEP 1: Get the row with maximum income

    max_income = df['income'].max()
    print(max_income)

    ### STEP 2: On above result, Filter the name column and extract string value from it

    result = df[df['income'] == max_income].values[0]

    return result
max_income(df)


53000


array(['Elon', 'M', 53000], dtype=object)

In [346]:
import pandas as pd

In [347]:
!gdown '1Dtm4ZlXPqcwi8T98acWdyQueU0S8UTWj'

Downloading...
From: https://drive.google.com/uc?id=1Dtm4ZlXPqcwi8T98acWdyQueU0S8UTWj
To: /content/titanic.csv
  0% 0.00/767 [00:00<?, ?B/s]100% 767/767 [00:00<00:00, 2.40MB/s]


In [348]:
titanic = pd.read_csv('titanic.csv')

In [349]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,embark_town,pclass_1,price,gender
0,1,3,female,26.0,0,0,7.925,S,Third,False,Southampton,3,7.925,female
1,1,1,female,38.0,1,0,71.2833,C,First,False,Cherbourg,1,71.2833,female
2,1,3,female,26.0,0,0,7.925,S,Third,False,Southampton,3,7.925,female
3,0,3,male,,0,0,8.4583,Q,Third,True,Queenstown,3,8.4583,male
4,1,2,female,14.0,1,0,30.0708,C,Second,False,Cherbourg,2,30.0708,female
5,0,3,male,,0,0,8.4583,Q,Third,True,Queenstown,3,8.4583,male
6,0,1,male,54.0,0,0,51.8625,S,First,True,Southampton,1,51.8625,male
7,1,1,female,38.0,1,0,71.2833,C,First,False,Cherbourg,1,71.2833,female
8,1,3,female,27.0,0,2,11.1333,S,Third,False,Southampton,3,11.1333,female
9,1,2,female,14.0,1,0,30.0708,C,Second,False,Cherbourg,2,30.0708,female


In [350]:
titanic.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
survived,1,1,1,0,1,0,0,1,1,1
pclass,3,1,3,3,2,3,1,1,3,2
sex,female,female,female,male,female,male,male,female,female,female
age,26.0,38.0,26.0,,14.0,,54.0,38.0,27.0,14.0
sibsp,0,1,0,0,1,0,0,1,0,1
parch,0,0,0,0,0,0,0,0,2,0
fare,7.925,71.2833,7.925,8.4583,30.0708,8.4583,51.8625,71.2833,11.1333,30.0708
embarked,S,C,S,Q,C,Q,S,C,S,C
class,Third,First,Third,Third,Second,Third,First,First,Third,Second
adult_male,False,False,False,True,False,True,True,False,False,False


In [351]:
'''
drop_duplicates(keep="first"): This method is used to remove duplicate rows from the DataFrame. The parameter keep="first" specifies that when duplicates are found, the first occurrence of the duplicate row should be kept, and subsequent occurrences should be dropped.

drop_duplicates() is a method that identifies and removes duplicate rows from a DataFrame.
The keep parameter determines which occurrence(s) of the duplicated rows to keep. Possible values are:
"first": Keep the first occurrence of each duplicated row.
"last": Keep the last occurrence of each duplicated row.
False: Drop all duplicated rows.
In this case, keep="first" ensures that only the first occurrence of each duplicated row is retained.
'''

titanic.T.drop_duplicates(keep = "first")

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
survived,1,1,1,0,1,0,0,1,1,1
pclass,3,1,3,3,2,3,1,1,3,2
sex,female,female,female,male,female,male,male,female,female,female
age,26.0,38.0,26.0,,14.0,,54.0,38.0,27.0,14.0
sibsp,0,1,0,0,1,0,0,1,0,1
parch,0,0,0,0,0,0,0,0,2,0
fare,7.925,71.2833,7.925,8.4583,30.0708,8.4583,51.8625,71.2833,11.1333,30.0708
embarked,S,C,S,Q,C,Q,S,C,S,C
class,Third,First,Third,Third,Second,Third,First,First,Third,Second
adult_male,False,False,False,True,False,True,True,False,False,False


In [201]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,embark_town,pclass_1,price,gender
0,1,3,female,26.0,0,0,7.925,S,Third,False,Southampton,3,7.925,female
1,1,1,female,38.0,1,0,71.2833,C,First,False,Cherbourg,1,71.2833,female
2,1,3,female,26.0,0,0,7.925,S,Third,False,Southampton,3,7.925,female
3,0,3,male,,0,0,8.4583,Q,Third,True,Queenstown,3,8.4583,male
4,1,2,female,14.0,1,0,30.0708,C,Second,False,Cherbourg,2,30.0708,female
5,0,3,male,,0,0,8.4583,Q,Third,True,Queenstown,3,8.4583,male
6,0,1,male,54.0,0,0,51.8625,S,First,True,Southampton,1,51.8625,male
7,1,1,female,38.0,1,0,71.2833,C,First,False,Cherbourg,1,71.2833,female
8,1,3,female,27.0,0,2,11.1333,S,Third,False,Southampton,3,11.1333,female
9,1,2,female,14.0,1,0,30.0708,C,Second,False,Cherbourg,2,30.0708,female


In [202]:
titanic.drop_duplicates(keep = 'first')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,embark_town,pclass_1,price,gender
0,1,3,female,26.0,0,0,7.925,S,Third,False,Southampton,3,7.925,female
1,1,1,female,38.0,1,0,71.2833,C,First,False,Cherbourg,1,71.2833,female
3,0,3,male,,0,0,8.4583,Q,Third,True,Queenstown,3,8.4583,male
4,1,2,female,14.0,1,0,30.0708,C,Second,False,Cherbourg,2,30.0708,female
6,0,1,male,54.0,0,0,51.8625,S,First,True,Southampton,1,51.8625,male
8,1,3,female,27.0,0,2,11.1333,S,Third,False,Southampton,3,11.1333,female


In [203]:
titanic.drop_duplicates(keep = 'last')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,embark_town,pclass_1,price,gender
2,1,3,female,26.0,0,0,7.925,S,Third,False,Southampton,3,7.925,female
5,0,3,male,,0,0,8.4583,Q,Third,True,Queenstown,3,8.4583,male
6,0,1,male,54.0,0,0,51.8625,S,First,True,Southampton,1,51.8625,male
7,1,1,female,38.0,1,0,71.2833,C,First,False,Cherbourg,1,71.2833,female
8,1,3,female,27.0,0,2,11.1333,S,Third,False,Southampton,3,11.1333,female
9,1,2,female,14.0,1,0,30.0708,C,Second,False,Cherbourg,2,30.0708,female


In [204]:
import pandas as pd

data = {
    'cust_id': [101, 102, 103, 104],
    'name': ['rick', 'morty', 'pickle', 'jerry']
}

customer = pd.DataFrame(data)

In [205]:
customer

Unnamed: 0,cust_id,name
0,101,rick
1,102,morty
2,103,pickle
3,104,jerry


In [206]:
import pandas as pd

data = {
    'order_id': ['OR1', 'OR3', 'OR23', 'OR42'],
    'cust_id': [102, 105, 101, 102],
    'amount': [1200, 650, 120, 989]
}

orders = pd.DataFrame(data)

In [207]:
orders

Unnamed: 0,order_id,cust_id,amount
0,OR1,102,1200
1,OR3,105,650
2,OR23,101,120
3,OR42,102,989


In [208]:
merged_df = customer.merge(orders, how = 'left', on = 'cust_id')

In [209]:
merged_df

Unnamed: 0,cust_id,name,order_id,amount
0,101,rick,OR23,120.0
1,102,morty,OR1,1200.0
2,102,morty,OR42,989.0
3,103,pickle,,
4,104,jerry,,


In [210]:
name = 'morty'
merged_df[merged_df.name == name]["amount"].sum()

2189.0

In [211]:
df = pd.DataFrame({'name' : ['a','b','c'], 'age' : [12,15,16]})

In [212]:
df

Unnamed: 0,name,age
0,a,12
1,b,15
2,c,16


In [218]:
data = [['d', 20], ['e', 21], ['f', 22]]
data

[['d', 20], ['e', 21], ['f', 22]]

In [219]:
df.append(data)

  df.append(data)


Unnamed: 0,name,age,0,1
0,a,12.0,,
1,b,15.0,,
2,c,16.0,,
0,,,d,20.0
1,,,e,21.0
2,,,f,22.0


In [217]:
pd.concat(df,data) # error is because data is not a data frame hence concatination cannot be done

  pd.concat(df,data)


TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"