# Module: Pandas Assignments
## Lesson: Pandas
### Assignment 1: DataFrame Creation and Indexing

1. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Set the index to be the first column.
2. Create a Pandas DataFrame with columns 'A', 'B', 'C' and index 'X', 'Y', 'Z'. Fill the DataFrame with random integers and access the element at row 'Y' and column 'B'.

### Assignment 2: DataFrame Operations

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Add a new column that is the product of the first two columns.
2. Create a Pandas DataFrame with 3 columns and 4 rows filled with random integers. Compute the row-wise and column-wise sum.

### Assignment 3: Data Cleaning

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Introduce some NaN values. Fill the NaN values with the mean of the respective columns.
2. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Introduce some NaN values. Drop the rows with any NaN values.

### Assignment 4: Data Aggregation

1. Create a Pandas DataFrame with 2 columns: 'Category' and 'Value'. Fill the 'Category' column with random categories ('A', 'B', 'C') and the 'Value' column with random integers. Group the DataFrame by 'Category' and compute the sum and mean of 'Value' for each category.
2. Create a Pandas DataFrame with 3 columns: 'Product', 'Category', and 'Sales'. Fill the DataFrame with random data. Group the DataFrame by 'Category' and compute the total sales for each category.

### Assignment 5: Merging DataFrames

1. Create two Pandas DataFrames with a common column. Merge the DataFrames using the common column.
2. Create two Pandas DataFrames with different columns. Concatenate the DataFrames along the rows and along the columns.

### Assignment 6: Time Series Analysis

1. Create a Pandas DataFrame with a datetime index and one column filled with random integers. Resample the DataFrame to compute the monthly mean of the values.
2. Create a Pandas DataFrame with a datetime index ranging from '2021-01-01' to '2021-12-31' and one column filled with random integers. Compute the rolling mean with a window of 7 days.

### Assignment 7: MultiIndex DataFrame

1. Create a Pandas DataFrame with a MultiIndex (hierarchical index). Perform some basic indexing and slicing operations on the MultiIndex DataFrame.
2. Create a Pandas DataFrame with MultiIndex consisting of 'Category' and 'SubCategory'. Fill the DataFrame with random data and compute the sum of values for each 'Category' and 'SubCategory'.

### Assignment 8: Pivot Tables

1. Create a Pandas DataFrame with columns 'Date', 'Category', and 'Value'. Create a pivot table to compute the sum of 'Value' for each 'Category' by 'Date'.
2. Create a Pandas DataFrame with columns 'Year', 'Quarter', and 'Revenue'. Create a pivot table to compute the mean 'Revenue' for each 'Quarter' by 'Year'.

### Assignment 9: Applying Functions

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Apply a function that doubles the values of the DataFrame.
2. Create a Pandas DataFrame with 3 columns and 6 rows filled with random integers. Apply a lambda function to create a new column that is the sum of the existing columns.

### Assignment 10: Working with Text Data

1. Create a Pandas Series with 5 random text strings. Convert all the strings to uppercase.
2. Create a Pandas Series with 5 random text strings. Extract the first three characters of each string.


In [2]:
import pandas as pd
import numpy as np

In [14]:
# 1. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Set the index to be the first column.

df1 = pd.DataFrame(np.random.randint(1,101,size=(6,4)),columns=['ID','A','B','C'])
print(df1)
df1.set_index('ID', inplace=True)
print(df1)

# 2. Create a Pandas DataFrame with columns 'A', 'B', 'C' and index 'X', 'Y', 'Z'. 
# Fill the DataFrame with random integers and access the element at row 'Y' and column 'B'.

df2 = pd.DataFrame(columns=['A', 'B', 'C'],index=['X', 'Y', 'Z'])
df2.loc['Y'] = [1,2,3]
df2.loc['Z'] = [4,5,6]
print(df2)
print()
print(df2.loc['Z']['B'])

   ID   A   B   C
0  34  48  77   3
1  57  18  56  18
2  85  99  59  46
3  24  89  99   5
4  61  13   5  80
5   7  41  14  27
     A   B   C
ID            
34  48  77   3
57  18  56  18
85  99  59  46
24  89  99   5
61  13   5  80
7   41  14  27
     A    B    C
X  NaN  NaN  NaN
Y    1    2    3
Z    4    5    6

5


In [32]:
# 1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Add a new column that is the product of the first two columns.

df = pd.DataFrame(np.random.randint(1,50,size=(5,3)),columns=['A','B','C'])
print(df)
df['new_proj'] = df['A']*df['B']
print(df)

# 2. Create a Pandas DataFrame with 3 columns and 4 rows filled with random integers. Compute the row-wise and column-wise sum.
df1=pd.DataFrame(np.random.randint(1,20,size=(3,4)))
print(df1)
df1['sum_row'] = df1.sum(axis=1)
df1['sum_col'] = df1.sum(axis=0)
print(df1)

    A   B   C
0   9   3  46
1   4   6  42
2   7  20   6
3  14  10  42
4   1  29  25
    A   B   C  new_proj
0   9   3  46        27
1   4   6  42        24
2   7  20   6       140
3  14  10  42       140
4   1  29  25        29
    0  1   2   3
0  11  9  16   8
1  18  6   6  11
2   7  8   2   3
    0  1   2   3  sum_row  sum_col
0  11  9  16   8       44       36
1  18  6   6  11       41       23
2   7  8   2   3       20       24


In [43]:
# 1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. 
# Introduce some NaN values. Fill the NaN values with the mean of the respective columns.

df = pd.DataFrame(np.random.randint(1,20,size=(5,3)))
df.loc[[1,3,4], 0] = np.nan
df.loc[[2,4], 1] = np.nan
df.loc[[0,2,4], 2] = np.nan
print(df)
df1=df.fillna(df.mean(axis=0))
print(df1)

# 2. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. 
# Introduce some NaN values. Drop the rows with any NaN values.

print('\n\n')
df2 = pd.DataFrame(np.random.randint(1,20,size=(6,4)))
df2.loc[[1,3,5], 0] = np.nan
print(df2)
df3 = df2.dropna(axis=0)
print('\n\n')
print(df3)

      0     1     2
0  14.0   5.0   NaN
1   NaN   6.0  11.0
2   3.0   NaN   NaN
3   NaN  13.0  12.0
4   NaN   NaN   NaN
      0     1     2
0  14.0   5.0  11.5
1   8.5   6.0  11.0
2   3.0   8.0  11.5
3   8.5  13.0  12.0
4   8.5   8.0  11.5



      0   1   2   3
0  12.0   1   7  18
1   NaN   8  18   3
2  14.0   9   8  15
3   NaN   1   9   6
4  13.0   9   6  12
5   NaN  14   8   9



      0  1  2   3
0  12.0  1  7  18
2  14.0  9  8  15
4  13.0  9  6  12


In [48]:
# 1. Create a Pandas DataFrame with 2 columns: 'Category' and 'Value'. Fill the 'Category' column with random categories ('A', 'B', 'C') and the 'Value' column with random integers. Group the DataFrame by 'Category' and compute the sum and mean of 'Value' for each category.
# 2. Create a Pandas DataFrame with 3 columns: 'Product', 'Category', and 'Sales'. Fill the DataFrame with random data. Group the DataFrame by 'Category' and compute the total sales for each category.

df = pd.DataFrame(columns=['Catergoty','Value'])
df['Catergoty'] = ['A','B','C','A','B','C']
df['Value'] = [10,20,30,40,50,60]
df_grouped = df.groupby('Catergoty')['Value'].agg(['sum','mean'])
print(df_grouped)

products = ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse']
categories = ['Electronics', 'Accessories']

np.random.seed(0)
df2 = pd.DataFrame({
    'Product': np.random.choice(products, 10),
    'Category': np.random.choice(categories, 10),
    'Sales': np.random.randint(100, 1000, 10)
    })
print(df2)
df_grouped1 = df2.groupby('Category')['Sales'].sum()
print(df_grouped1)

           sum  mean
Catergoty           
A           50  25.0
B           70  35.0
C           90  45.0
    Product     Category  Sales
0  Keyboard  Electronics    187
1     Mouse  Accessories    274
2    Laptop  Electronics    700
3   Monitor  Electronics    949
4   Monitor  Electronics    777
5   Monitor  Electronics    637
6     Phone  Electronics    945
7   Monitor  Accessories    172
8     Mouse  Electronics    877
9    Tablet  Accessories    215
Category
Accessories     661
Electronics    5072
Name: Sales, dtype: int32


In [59]:
# 1. Create two Pandas DataFrames with a common column. Merge the DataFrames using the common column.
np.random.seed(42)
df1= pd.DataFrame(np.random.randint(1,20,size=[3,5]),columns=['A','B','C','D', 'E'])
df2= pd.DataFrame(np.random.randint(1,20,size=[3,5]),columns=['A','B','C','D', 'E'])

print("DataFrame 1:\n", df1)
print("\nDataFrame 2:\n", df2)
merge = pd.merge(df1,df2,on=['A','B','C'])
print(merge)

# 2. Create two Pandas DataFrames with different columns.
# Concatenate the DataFrames along the rows and along the columns.

df3=pd.DataFrame({
    'A': np.random.randint(1, 10, 3),
    'B': np.random.randint(10, 20, 3)
})
df4=pd.DataFrame({
    'C': np.random.randint(1, 10, 3),
    'D': np.random.randint(10, 20, 3)
})
print("DataFrame 1:\n", df1)
print("\nDataFrame 2:\n", df2)

# Concatenate along the rows (axis=0)
concat_rows = pd.concat([df1, df2], axis=0)

# Concatenate along the columns (axis=1)
concat_cols = pd.concat([df1, df2], axis=1)

print("\nConcatenated along rows:\n", concat_rows)
print("\nConcatenated along columns:\n", concat_cols)

DataFrame 1:
     A   B   C  D  E
0   7  15  11  8  7
1  19  11  11  4  8
2   3   2  12  6  2

DataFrame 2:
     A   B   C   D   E
0   1  12  12  17  10
1  16  15  15  19  12
2   3   5  19   7   9
Empty DataFrame
Columns: [A, B, C, D_x, E_x, D_y, E_y]
Index: []
DataFrame 1:
     A   B   C  D  E
0   7  15  11  8  7
1  19  11  11  4  8
2   3   2  12  6  2

DataFrame 2:
     A   B   C   D   E
0   1  12  12  17  10
1  16  15  15  19  12
2   3   5  19   7   9

Concatenated along rows:
     A   B   C   D   E
0   7  15  11   8   7
1  19  11  11   4   8
2   3   2  12   6   2
0   1  12  12  17  10
1  16  15  15  19  12
2   3   5  19   7   9

Concatenated along columns:
     A   B   C  D  E   A   B   C   D   E
0   7  15  11  8  7   1  12  12  17  10
1  19  11  11  4  8  16  15  15  19  12
2   3   2  12  6  2   3   5  19   7   9


In [20]:
# 1. Create a Pandas DataFrame with a datetime index and one column filled with random integers. Resample the DataFrame to compute the monthly mean of the values.

dates = pd.date_range(start='2024-01-01',end='2024-12-31',freq='D')

df = pd.DataFrame({
    'Value':np.random.randint(0,100,size=len(dates))
},index=dates)

montly=df.resample('M').mean()
print(montly)

# 2. Create a Pandas DataFrame with a datetime index ranging from '2021-01-01' to '2021-12-31' and one column filled with random integers. Compute the rolling mean with a window of 7 days.
dates1 = pd.date_range(start='2021-01-01',end='2021-12-31',freq='D')

df1 = pd.DataFrame({
    'Value':np.random.randint(0,100,size=len(dates1))
},index=dates1)

weekly=df1.rolling(window=7).mean()
print(weekly)

                Value
2024-01-31  41.677419
2024-02-29  50.931034
2024-03-31  48.709677
2024-04-30  51.733333
2024-05-31  42.290323
2024-06-30  40.266667
2024-07-31  57.000000
2024-08-31  47.645161
2024-09-30  46.233333
2024-10-31  52.774194
2024-11-30  53.733333
2024-12-31  46.225806
                Value
2021-01-01        NaN
2021-01-02        NaN
2021-01-03        NaN
2021-01-04        NaN
2021-01-05        NaN
...               ...
2021-12-27  36.571429
2021-12-28  33.428571
2021-12-29  37.000000
2021-12-30  40.142857
2021-12-31  53.571429

[365 rows x 1 columns]


  montly=df.resample('M').mean()


In [31]:
# 1. Create a Pandas DataFrame with a MultiIndex (hierarchical index). Perform some basic indexing and slicing operations on the MultiIndex DataFrame.
arrays = [
    ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
    [2020, 2021, 2020, 2021, 2020, 2021, 2020, 2021]
]
index = pd.MultiIndex.from_arrays(arrays, names=('Region', 'Year'))

# Create the DataFrame
df = pd.DataFrame({
    'Sales': np.random.randint(100, 500, size=8),
    'Profit': np.random.randint(20, 100, size=8)
}, index=index)

# Show the full DataFrame
print("Original DataFrame:\n", df)

# Example 1: Indexing - Get data for 'North'
print("\nData for North region:\n", df.loc['North'])

# Example 2: Indexing - Get data for 'South' in 2021
print("\nData for South region in 2021:\n", df.loc[('South', 2021)])

# Example 3: Slicing - All regions for the year 2020
print("\nAll regions in 2020:\n", df.xs(2020, level='Year'))

# Example 4: Cross-section using .xs
print("\nCross-section for Year=2021:\n", df.xs(2021, level='Year'))

# 2. Create a Pandas DataFrame with MultiIndex consisting of 'Category' and 'SubCategory'. Fill the DataFrame with random data and compute the sum of values for each 'Category' and 'SubCategory'.

categories1 = ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Groceries', 'Groceries']
subcategories1 = ['Phones', 'Laptops', 'Men', 'Women', 'Fruits', 'Vegetables']

index1=pd.MultiIndex.from_arrays([categories1,subcategories1],names=('Category','SubCategory'))
df1=pd.DataFrame({
    'Sales': np.random.randint(1000, 5000, size=6),
    'Profit': np.random.randint( 100, 1000, size=6)
}, index=index1)

gp_sum = df1.groupby(["Category","SubCategory"]).sum()
print(gp_sum)
gp_cat_sum = df1.groupby(level='Category').sum()
print(gp_cat_sum)


Original DataFrame:
              Sales  Profit
Region Year               
North  2020    382      24
       2021    357      35
South  2020    446      22
       2021    469      73
East   2020    205      85
       2021    143      58
West   2020    223      35
       2021    195      59

Data for North region:
       Sales  Profit
Year               
2020    382      24
2021    357      35

Data for South region in 2021:
 Sales     469
Profit     73
Name: (South, 2021), dtype: int32

All regions in 2020:
         Sales  Profit
Region               
North     382      24
South     446      22
East      205      85
West      223      35

Cross-section for Year=2021:
         Sales  Profit
Region               
North     357      35
South     469      73
East      143      58
West      195      59
                         Sales  Profit
Category    SubCategory               
Clothing    Men           3848     504
            Women         3802     388
Electronics Laptops       4192     

In [33]:

### Assignment 8: Pivot Tables

# 1. Create a Pandas DataFrame with columns 'Date', 'Category', and 'Value'. Create a pivot table to compute the sum of 'Value' for each 'Category' by 'Date'.

dates = pd.date_range(start='2021-01-01', periods=10, freq='D')

# Sample categories
categories = ['A', 'B', 'C']

# Create a DataFrame
df = pd.DataFrame({
    'Date': np.random.choice(dates, size=30),
    'Category': np.random.choice(categories, size=30),
    'Value': np.random.randint(10, 100, size=30)
})

print("Original DataFrame:\n", df.head())

# Create pivot table
pivot = pd.pivot_table(df, values='Value', index='Date', columns='Category', aggfunc='sum', fill_value=0)

print("\nPivot Table (Sum of Value by Date and Category):\n", pivot)


# 2. Create a Pandas DataFrame with columns 'Year', 'Quarter', and 'Revenue'. Create a pivot table to compute the mean 'Revenue' for each 'Quarter' by 'Year'.
years = [2020, 2021, 2022, 2023]
quarters = ['Q1', 'Q2', 'Q3', 'Q4']

# Create a DataFrame with random revenue data
data = {
    'Year': np.random.choice(years, size=20),
    'Quarter': np.random.choice(quarters, size=20),
    'Revenue': np.random.randint(10000, 50000, size=20)
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df.head())

# Create pivot table: mean revenue by quarter and year
pivot = pd.pivot_table(df, values='Revenue', index='Year', columns='Quarter', aggfunc='mean', fill_value=0)

print("\nPivot Table (Mean Revenue by Year and Quarter):\n", pivot)


Original DataFrame:
         Date Category  Value
0 2021-01-08        C     22
1 2021-01-03        C     41
2 2021-01-01        B     50
3 2021-01-10        B     48
4 2021-01-08        B     70

Pivot Table (Sum of Value by Date and Category):
 Category      A    B    C
Date                     
2021-01-01    0   50    0
2021-01-02   33    0   70
2021-01-03   13    0  112
2021-01-04    0   11  149
2021-01-05    0   64    0
2021-01-06    0   35   59
2021-01-07   32    0   16
2021-01-08    0   70   22
2021-01-09  131   19   36
2021-01-10  159  206    0
Original DataFrame:
    Year Quarter  Revenue
0  2020      Q1    31185
1  2023      Q1    24134
2  2022      Q2    43654
3  2021      Q2    28083
4  2021      Q1    13573

Pivot Table (Mean Revenue by Year and Quarter):
 Quarter       Q1            Q2       Q3       Q4
Year                                            
2020     27090.5  13060.000000  30660.5      0.0
2021     12824.0  33965.333333  41968.0  40582.0
2022         0.0  33354.0

In [35]:

### Assignment 9: Applying Functions

# 1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Apply a function that doubles the values of the DataFrame.

df = pd.DataFrame(np.random.randint(0, 50, size=(5, 3)), columns=['A', 'B', 'C'])

print("Original DataFrame:\n", df)

# Apply a function to double each value
df_doubled = df.applymap(lambda x: x * 2)

print("\nDoubled DataFrame:\n", df_doubled)

# 2. Create a Pandas DataFrame with 3 columns and 6 rows filled with random integers. Apply a lambda function to create a new column that is the sum of the existing columns.

df = pd.DataFrame(np.random.randint(0, 50, size=(6, 3)),columns=['A','B','C'])
print("Original DataFrame:\n", df)
# Apply a lambda function to create a new column
df['sum'] = df.apply(lambda row: row['A'] + row['B'] +
                     row['C'], axis=1)
print("\nDataFrame with new column:\n", df)


Original DataFrame:
     A   B   C
0  20  30  33
1   4   3  32
2  10  38  15
3   3  40  17
4  39   5  39

Doubled DataFrame:
     A   B   C
0  40  60  66
1   8   6  64
2  20  76  30
3   6  80  34
4  78  10  78
Original DataFrame:
     A   B   C
0  17  22  32
1   2  43  24
2  22  39   0
3  16  26  45
4  37  10  30
5  12  28  11

DataFrame with new column:
     A   B   C  sum
0  17  22  32   71
1   2  43  24   69
2  22  39   0   61
3  16  26  45   87
4  37  10  30   77
5  12  28  11   51


  df_doubled = df.applymap(lambda x: x * 2)


In [36]:
### Assignment 10: Working with Text Data

# 1. Create a Pandas Series with 5 random text strings. Convert all the strings to uppercase.

words = ['apple', 'banana', 'cherry', 'date', 'fig', 'grape', 'melon']

# Create a Series with 5 random text strings
series = pd.Series(np.random.choice(words, size=5))

print("Original Series:\n", series)

# Convert all strings to uppercase
uppercase_series = series.str.upper()

print("\nUppercase Series:\n", uppercase_series)

# 2. Create a Pandas Series with 5 random text strings. Extract the first three characters of each string.
series = pd.Series(np.random.choice(words, size=5))
print("Original Series:\n", series)
# Extract the first three characters of each string
first_three_chars = series.str[:3]
print("\nFirst Three Characters Series:\n", first_three_chars)


Original Series:
 0     apple
1    banana
2       fig
3    cherry
4    banana
dtype: object

Uppercase Series:
 0     APPLE
1    BANANA
2       FIG
3    CHERRY
4    BANANA
dtype: object
Original Series:
 0    apple
1    grape
2    apple
3    apple
4    grape
dtype: object

First Three Characters Series:
 0    app
1    gra
2    app
3    app
4    gra
dtype: object
