# Module: Pandas Assignments
## Lesson: Pandas
### Assignment 1: DataFrame Creation and Indexing

1. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Set the index to be the first column.
2. Create a Pandas DataFrame with columns 'A', 'B', 'C' and index 'X', 'Y', 'Z'. Fill the DataFrame with random integers and access the element at row 'Y' and column 'B'.

### Assignment 2: DataFrame Operations

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Add a new column that is the product of the first two columns.
2. Create a Pandas DataFrame with 3 columns and 4 rows filled with random integers. Compute the row-wise and column-wise sum.

### Assignment 3: Data Cleaning

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Introduce some NaN values. Fill the NaN values with the mean of the respective columns.
2. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Introduce some NaN values. Drop the rows with any NaN values.

### Assignment 4: Data Aggregation

1. Create a Pandas DataFrame with 2 columns: 'Category' and 'Value'. Fill the 'Category' column with random categories ('A', 'B', 'C') and the 'Value' column with random integers. Group the DataFrame by 'Category' and compute the sum and mean of 'Value' for each category.
2. Create a Pandas DataFrame with 3 columns: 'Product', 'Category', and 'Sales'. Fill the DataFrame with random data. Group the DataFrame by 'Category' and compute the total sales for each category.

### Assignment 5: Merging DataFrames

1. Create two Pandas DataFrames with a common column. Merge the DataFrames using the common column.
2. Create two Pandas DataFrames with different columns. Concatenate the DataFrames along the rows and along the columns.

### Assignment 6: Time Series Analysis

1. Create a Pandas DataFrame with a datetime index and one column filled with random integers. Resample the DataFrame to compute the monthly mean of the values.
2. Create a Pandas DataFrame with a datetime index ranging from '2021-01-01' to '2021-12-31' and one column filled with random integers. Compute the rolling mean with a window of 7 days.

### Assignment 7: MultiIndex DataFrame

1. Create a Pandas DataFrame with a MultiIndex (hierarchical index). Perform some basic indexing and slicing operations on the MultiIndex DataFrame.
2. Create a Pandas DataFrame with MultiIndex consisting of 'Category' and 'SubCategory'. Fill the DataFrame with random data and compute the sum of values for each 'Category' and 'SubCategory'.

### Assignment 8: Pivot Tables

1. Create a Pandas DataFrame with columns 'Date', 'Category', and 'Value'. Create a pivot table to compute the sum of 'Value' for each 'Category' by 'Date'.
2. Create a Pandas DataFrame with columns 'Year', 'Quarter', and 'Revenue'. Create a pivot table to compute the mean 'Revenue' for each 'Quarter' by 'Year'.

### Assignment 9: Applying Functions

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Apply a function that doubles the values of the DataFrame.
2. Create a Pandas DataFrame with 3 columns and 6 rows filled with random integers. Apply a lambda function to create a new column that is the sum of the existing columns.

### Assignment 10: Working with Text Data

1. Create a Pandas Series with 5 random text strings. Convert all the strings to uppercase.
2. Create a Pandas Series with 5 random text strings. Extract the first three characters of each string.


In [1]:
### solution 1.1

#importing libraries
import pandas as pd
import numpy as np

data = np.random.randint(1,101, size=(4,6))

#pandas dataframe
df = pd.DataFrame(data, columns=['A','B', 'C', 'D','E','F'])
print("Dataframe: \n", df)

#set index to first column
df.set_index('A', inplace=True)
print("Dataframe with index to first column: \n", df)

Dataframe: 
     A   B   C   D   E   F
0  56  48  42  30  77  12
1  59  27  71   8  23  15
2  46   7  34  55  67  75
3  99  74  33  99  98  15
Dataframe with index to first column: 
      B   C   D   E   F
A                     
56  48  42  30  77  12
59  27  71   8  23  15
46   7  34  55  67  75
99  74  33  99  98  15


In [9]:
### solution 1.2

df = pd.DataFrame(np.random.randint(1,16,size=(3,3)), columns=['A','B','C'], index=['X','Y','Z'])
print("Datafram: \n", df)

#elements at row 'Y' and column 'B'
print("elements at row 'Y' and column 'B': \n", df.at['Y','B'])

Datafram: 
     A   B  C
X  10   4  1
Y  13  10  6
Z   6   6  9
elements at row 'Y' and column 'B': 
 10


In [4]:
### solution 2.1

#dataframe with 3 cols and 5 rows
df = pd.DataFrame(np.random.randint(1,51,size=(5,3)), columns=['A','B','C'])
print("Dataframe: \n", df)

#new column with product of first two column
df['Prod'] = df['A']*df['B']
print("Modified df: \n", df)

Dataframe: 
     A   B   C
0   7  41  50
1  30  34   9
2  21  13  41
3  15  15  28
4  15  30  22
Modified df: 
     A   B   C  Prod
0   7  41  50   287
1  30  34   9  1020
2  21  13  41   273
3  15  15  28   225
4  15  30  22   450


In [5]:
### solution 2.2

# DF with 3 cols and 4 rows
df = pd.DataFrame(np.random.randint(1,61,size=(4,3)), columns=['A','B','C'])
print("Dataframe: \n", df)

# row wise sum
row_sum = df.sum(axis=1)

#colum wise sum
col_sum = df.sum(axis=0)

print("Row-wise sum of df: \n", row_sum)

print("Column-wise sum of df: \n", col_sum)

Dataframe: 
     A   B   C
0  47  16  12
1  58  33  41
2  36  33  59
3  11  40  18
Row-wise sum of df: 
 0     75
1    132
2    128
3     69
dtype: int64
Column-wise sum of df: 
 A    152
B    122
C    130
dtype: int64


In [15]:
### solution 3.1

#DF with 3 cols and 5 rows
df = pd.DataFrame(np.random.randint(1,51,size=(5,3)), columns=["A","B","C"])
print("Dataframe: \n", df)

#introduce some NaN values
df.iloc[1,0] = np.nan
df.iloc[4,1] = np.nan
df.iloc[0,2] = np.nan

print("Dataframe with NAN: \n", df)

#replace na values with column mean
df.fillna(df.mean(), inplace=True)

print("Cleaned DF: \n", df)

Dataframe: 
     A   B   C
0  15  18  49
1  50  24  30
2  42  49  40
3  48   4   3
4  10   1  36
Dataframe with NAN: 
       A     B     C
0  15.0  18.0   NaN
1   NaN  24.0  30.0
2  42.0  49.0  40.0
3  48.0   4.0   3.0
4  10.0   NaN  36.0
Cleaned DF: 
        A      B      C
0  15.00  18.00  27.25
1  28.75  24.00  30.00
2  42.00  49.00  40.00
3  48.00   4.00   3.00
4  10.00  23.75  36.00


In [18]:
### solution 3.2

#dataframe with 4 cols and 6 rows
df = pd.DataFrame(np.random.randint(1,21,size=(6,4)), columns=['A','B','C','D'])
print('Dataframe: \n', df)

#introduce nan values
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan

print("Dataframe with NAN: \n", df)

#drop rows with NaN
df.dropna(axis=0, inplace=True)

print("Cleaned DF: \n", df)

Dataframe: 
     A   B   C   D
0  15  12  13  19
1   1  12   7  16
2  19  19   1  13
3  11   8   5  11
4  20  13  20   5
5   2  17  12   2
Dataframe with NAN: 
     A     B     C   D
0  15   NaN  13.0  19
1   1  12.0   NaN  16
2  19  19.0   1.0  13
3  11   8.0   5.0  11
4  20  13.0  20.0   5
5   2  17.0  12.0   2
Cleaned DF: 
     A     B     C   D
2  19  19.0   1.0  13
3  11   8.0   5.0  11
4  20  13.0  20.0   5
5   2  17.0  12.0   2


In [25]:
### solution 4.1

df = pd.DataFrame({'Category': np.random.choice(['A','B','C'],size=10), 'Value':np.random.randint(1,100, size=10)})
print("Dataframe: \n",df)

#group by 'Category' to compute sum and mean of value
print("Sum of value of each category: \n", df.groupby('Category')['Value'].sum())

print("Mean of value of each category: \n", df.groupby('Category')['Value'].mean())

Dataframe: 
   Category  Value
0        B     14
1        B     90
2        C     35
3        C     95
4        A     93
5        A     88
6        C     18
7        C     60
8        B     68
9        C     21
Sum of value of each category: 
 Category
A    181
B    172
C    229
Name: Value, dtype: int32
Mean of value of each category: 
 Category
A    90.500000
B    57.333333
C    45.800000
Name: Value, dtype: float64


In [30]:
### solution 4.2

#Dataframe with 3 columns
df = pd.DataFrame({'Product': np.random.choice(['A','B','C'],size=10), 'Category':np.random.choice(['X','Y','Z'],size=10),'Sales':np.random.randint(55,79,size=10)})
print("Dataframe: \n", df)

#Total sales by category
print("Total sales by category: \n")
df.groupby('Category')['Sales'].sum()

Dataframe: 
   Product Category  Sales
0       B        Y     61
1       C        X     78
2       A        X     76
3       C        X     62
4       C        Y     76
5       A        X     67
6       C        X     62
7       B        X     67
8       A        Y     56
9       A        X     70
Total sales by category: 



Category
X    482
Y    193
Name: Sales, dtype: int32

In [37]:
### solution 5.1

#two dfs with common columns
df1 = pd.DataFrame({'Key':['A','B','C'], 'V1':np.random.randint(1,11, size=3), 'V2':np.random.randint(1,11,size=3)})
df2 = pd.DataFrame({'Key':['A','B','C'], 'V1':np.random.randint(1,11, size=3), 'V2':np.random.randint(1,11,size=3)})
print("Dataframe 1: \n", df1)
print("Dataframe 2 \n", df2)

#merge two dataframes
merged_df = pd.merge(df1, df2, on='Key')
print('Merged Dataframe: \n', merged_df)

Dataframe 1: 
   Key  V1  V2
0   A   6   8
1   B  10   9
2   C   5   6
Dataframe 2 
   Key  V1  V2
0   A   6   2
1   B   2  10
2   C   3   3
Merged Dataframe: 
   Key  V1_x  V2_x  V1_y  V2_y
0   A     6     8     6     2
1   B    10     9     2    10
2   C     5     6     3     3


In [40]:
### solution 5.2

#two dfs with different columns
df1 = pd.DataFrame({'A':np.random.randint(1,101,size=5), 'B':np.random.randint(1,101,size=5)})
df2 = pd.DataFrame({'C':np.random.randint(1,101,size=5), 'D':np.random.randint(1,101,size=5)})

print("DF 1: \n", df1)
print("DF 2: \n", df2)

#concatenating dataframes
concat_df_1 = pd.concat([df1, df2], axis=0)
print("concatenated DF along the rows: \n", concat_df_1)

concat_df_2 = pd.concat([df1, df2], axis=1)
print("concatenated DF along the columns: \n", concat_df_2)

DF 1: 
     A   B
0  91  45
1  96  68
2  55  51
3   4  51
4   6  61
DF 2: 
     C   D
0  29  68
1  72  55
2  37  62
3  44  70
4  89  37
concatenated DF along the rows: 
       A     B     C     D
0  91.0  45.0   NaN   NaN
1  96.0  68.0   NaN   NaN
2  55.0  51.0   NaN   NaN
3   4.0  51.0   NaN   NaN
4   6.0  61.0   NaN   NaN
0   NaN   NaN  29.0  68.0
1   NaN   NaN  72.0  55.0
2   NaN   NaN  37.0  62.0
3   NaN   NaN  44.0  70.0
4   NaN   NaN  89.0  37.0
concatenated DF along the columns: 
     A   B   C   D
0  91  45  29  68
1  96  68  72  55
2  55  51  37  62
3   4  51  44  70
4   6  61  89  37


In [43]:
### solution 6.1

date = pd.date_range(start='2025-01-01', end='2025-12-31', freq='D')
df = pd.DataFrame(date, columns=['date'])
df['data']=np.random.randint(0,101,size=len(date))
df.set_index('date', inplace=True)

print("Dataframe: \n", df.head())

monthly_mean = df.resample('M').mean()
print('Monthly mean df:')
print(monthly_mean)


Dataframe: 
             data
date            
2025-01-01    55
2025-01-02    51
2025-01-03    90
2025-01-04    86
2025-01-05    72


  monthly_mean = df.resample('M').mean()


Monthly mean df:
                 data
date                 
2025-01-31  55.806452
2025-02-28  60.750000
2025-03-31  49.032258
2025-04-30  45.600000
2025-05-31  47.870968
2025-06-30  47.866667
2025-07-31  56.677419
2025-08-31  55.709677
2025-09-30  36.500000
2025-10-31  57.580645
2025-11-30  45.166667
2025-12-31  47.580645


In [46]:
### solution 6.2

date = pd.date_range(start='2021-01-01', end='2021-12-31', freq='D')
df = pd.DataFrame(date, columns=['date'])
df['data'] = np.random.randint(0,100,size=len(date))
df.set_index('date', inplace=True)

print("Dataframe: \n", df.head())

# rolling mean
rollingmean = df.rolling(window=7).mean()
print("Rollng mean dataframe: \n", rollingmean.head(10))

Dataframe: 
             data
date            
2021-01-01    42
2021-01-02    76
2021-01-03    12
2021-01-04    29
2021-01-05    94
Rollng mean dataframe: 
                  data
date                 
2021-01-01        NaN
2021-01-02        NaN
2021-01-03        NaN
2021-01-04        NaN
2021-01-05        NaN
2021-01-06        NaN
2021-01-07  57.000000
2021-01-08  60.571429
2021-01-09  56.000000
2021-01-10  54.428571


In [49]:
### solution 7.1

#Pandas Dataframe with a multindex
arrays = [['A','A','B','B'],['one','two','one','two']]
index = pd.MultiIndex.from_arrays(arrays, names=('Category', 'Subcategory'))
df = pd.DataFrame(np.random.randint(1,100,size=(4,3)), index=index, columns=['Val1','Val2','Val3'])
print("Multiindex DF: \n", df)

#Basic indexing and slicing
print("Indexing at Category A: ", df.loc['A'])

print("Slicing at Category B and subcategory 'one': \n", df.loc[('B','one')])

Multiindex DF: 
                       Val1  Val2  Val3
Category Subcategory                  
A        one            11    25    77
         two            55    96    81
B        one            76    65    55
         two            74    41    71
Indexing at Category A:               Val1  Val2  Val3
Subcategory                  
one            11    25    77
two            55    96    81
Slicing at Category B and subcategory 'one': 
 Val1    76
Val2    65
Val3    55
Name: (B, one), dtype: int32


In [52]:
### solution 7.2

arrays = [['A','A','B','B','C','C'], ['one','two','three','one','two','three']]
index = pd.MultiIndex.from_arrays(arrays, names=('Category','Subcategory'))
df = pd.DataFrame(np.random.randint(1,51,size=(6,3)), index=index,columns = ['val1','val2','val3'])
print('multiindex DF: \n', df)

sum_vals = df.groupby(['Category','Subcategory']).sum()
print("sum of values: \n", sum_vals)

multiindex DF: 
                       val1  val2  val3
Category Subcategory                  
A        one            42    33    16
         two            32    44    18
B        three          37    32     8
         one            42    35    20
C        two            40    20    31
         three           5    45    13
sum of values: 
                       val1  val2  val3
Category Subcategory                  
A        one            42    33    16
         two            32    44    18
B        one            42    35    20
         three          37    32     8
C        three           5    45    13
         two            40    20    31


In [60]:
### solution 8.1

date = pd.date_range(start='2025-01-01', end='2025-01-31',freq='D')
df = pd.DataFrame({'Date':np.random.choice(date,size=100), 'Category':np.random.choice(('A','B','C'), 100), 'Value': np.random.randint(75,99,size=100)})
print("Dataframe: \n", df.head())

#create a pivot table to compute the sum of 'value' for each 'category'
piv_tab = df.pivot_table(values='Value', index='Date', columns='Category', aggfunc='sum')
print("Pivot table: \n", piv_tab)

Dataframe: 
         Date Category  Value
0 2025-01-16        B     88
1 2025-01-18        C     98
2 2025-01-07        B     86
3 2025-01-02        C     78
4 2025-01-21        A     92
Pivot table: 
 Category        A      B      C
Date                           
2025-01-01   90.0   87.0    NaN
2025-01-02    NaN   82.0   78.0
2025-01-03   98.0    NaN   94.0
2025-01-04  273.0   94.0  194.0
2025-01-05    NaN  183.0  344.0
2025-01-06  173.0    NaN    NaN
2025-01-07    NaN   86.0   75.0
2025-01-08  183.0    NaN   92.0
2025-01-09   75.0   91.0   76.0
2025-01-10  180.0    NaN    NaN
2025-01-11   90.0  257.0   76.0
2025-01-12  172.0   82.0    NaN
2025-01-13  183.0    NaN   90.0
2025-01-14  250.0    NaN   83.0
2025-01-15    NaN   96.0    NaN
2025-01-16   83.0   88.0   96.0
2025-01-17   80.0    NaN   88.0
2025-01-18  156.0  257.0  175.0
2025-01-19   88.0   97.0    NaN
2025-01-20  256.0   78.0    NaN
2025-01-21  265.0    NaN  260.0
2025-01-22    NaN   96.0   91.0
2025-01-23    NaN   78.0   76.

In [58]:
### solution 8.2

df = pd.DataFrame({'Year':np.random.choice(([2022,2023,2024]), size=20), 'Quarter':np.random.choice(('Q1','Q2','Q3'),size=20), 'Revenue':np.random.randint(800,850,size=20)})
print("Dataframe: \n", df)

#create a pivot table to compute the mean 'Revenue' for each 'Quarter' by 'Year'
piv_tab = df.pivot_table(values='Revenue', index='Year', columns='Quarter', aggfunc='mean')
print('pivot table: \n', piv_tab)

Dataframe: 
     Year Quarter  Revenue
0   2022      Q3      815
1   2022      Q1      848
2   2023      Q2      831
3   2023      Q3      823
4   2022      Q3      805
5   2022      Q3      803
6   2023      Q1      835
7   2022      Q3      836
8   2022      Q1      843
9   2024      Q1      819
10  2022      Q1      848
11  2024      Q3      817
12  2022      Q2      807
13  2022      Q1      811
14  2022      Q1      811
15  2024      Q2      800
16  2024      Q3      807
17  2024      Q3      846
18  2022      Q1      813
19  2022      Q1      841
pivot table: 
 Quarter          Q1     Q2          Q3
Year                                  
2022     830.714286  807.0  814.750000
2023     835.000000  831.0  823.000000
2024     819.000000  800.0  823.333333


In [62]:
### solution 9.1

df = pd.DataFrame(np.random.randint(51,61,size=(5,3)), columns=['A','B','C'])
print("Dataframe: \n", df)

df_double = df.applymap(lambda x:x*2)
print("Modified DF: \n", df_double)

Dataframe: 
     A   B   C
0  60  60  57
1  57  53  59
2  56  57  58
3  60  57  57
4  53  57  58
Modified DF: 
      A    B    C
0  120  120  114
1  114  106  118
2  112  114  116
3  120  114  114
4  106  114  116


  df_double = df.applymap(lambda x:x*2)


In [66]:
### solution 9.2

df = pd.DataFrame(np.random.randint(51,61,size=(6,3)), columns=['A','B','C'])
print("Dataframe: \n", df)

df['Sum'] = df.apply(lambda row: row.sum(), axis=1)
print("Modified Df: \n", df)

Dataframe: 
     A   B   C
0  57  59  58
1  53  51  53
2  56  54  55
3  56  56  58
4  51  59  54
5  57  58  54
Modified Df: 
     A   B   C  Sum
0  57  59  58  174
1  53  51  53  157
2  56  54  55  165
3  56  56  58  170
4  51  59  54  164
5  57  58  54  169


In [67]:
### solution 10.1

data = pd.Series(['Apple','Box','carrot','dog','high'])
print('Series: \n', data)

data_upper = data.str.upper()
print("Uppercase series: \n", data_upper)

Series: 
 0     Apple
1       Box
2    carrot
3       dog
4      high
dtype: object
Uppercase series: 
 0     APPLE
1       BOX
2    CARROT
3       DOG
4      HIGH
dtype: object


In [73]:
### solution 10.2

data = pd.Series(['Apple','Box','carrot','dog','high'])
print('Series: \n', data)

first_3 = data.str[:3]
print("first 3 chars Series: \n", first_3)

Series: 
 0     Apple
1       Box
2    carrot
3       dog
4      high
dtype: object
first 3 chars Series: 
 0    App
1    Box
2    car
3    dog
4    hig
dtype: object
