Q1. List any five functions of the pandas library with execution.

1. read_csv(): Reads a CSV file into a DataFrame.
2. head(): Returns the first n rows of a DataFrame.
3. describe(): Generates descriptive statistics.
4. groupby(): Groups DataFrame using a mapper or by a Series of columns.
5. merge(): Merges DataFrame or named Series objects with a database-style join.

In [2]:
import pandas as pd
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target

# 1. head()
print("First 5 rows of the DataFrame:")
print(df.head())

# 2. describe()
print("\nDescriptive statistics:")
print(df.describe())

# 3. groupby()
grouped_df = df.groupby('species').mean()
print("\nGrouped DataFrame by 'species':")
print(grouped_df)

# 4. merge()
df1 = df[['sepal length (cm)', 'species']].iloc[:75]
df2 = df[['sepal width (cm)', 'species']].iloc[75:]
merged_df = pd.merge(df1, df2, on='species', how='inner')
print("\nMerged DataFrame:")
print(merged_df.head())


First 5 rows of the DataFrame:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   species  
0        0  
1        0  
2        0  
3        0  
4        0  

Descriptive statistics:
       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
5

Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [3]:
def reindex_dataframe(df):
    new_index = range(1, 2 * len(df) + 1, 2)
    df_reindexed = df.set_index(pd.Index(new_index))
    return df_reindexed

# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print("Original DataFrame:")
print(df)

# Reindexing DataFrame
df_reindexed = reindex_dataframe(df)
print("\nReindexed DataFrame:")
print(df_reindexed)


Original DataFrame:
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

Reindexed DataFrame:
   A  B  C
1  1  4  7
3  2  5  8
5  3  6  9


Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.

In [4]:
def sum_first_three_values(df):
    sum_values = df['Values'][:3].sum()
    print(f"Sum of the first three values: {sum_values}")

# Example DataFrame
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
print("Original DataFrame:")
print(df)

# Calculating sum of the first three values
sum_first_three_values(df)


Original DataFrame:
   Values
0      10
1      20
2      30
3      40
4      50
Sum of the first three values: 60


Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.

In [5]:
def add_word_count_column(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
    return df

# Example DataFrame
df = pd.DataFrame({'Text': ["Hello world", "Pandas is great", "This is a test"]})
print("Original DataFrame:")
print(df)

# Adding Word_Count column
df_with_word_count = add_word_count_column(df)
print("\nDataFrame with Word_Count column:")
print(df_with_word_count)


Original DataFrame:
              Text
0      Hello world
1  Pandas is great
2   This is a test

DataFrame with Word_Count column:
              Text  Word_Count
0      Hello world           2
1  Pandas is great           3
2   This is a test           4


Q5. How are DataFrame.size() and DataFrame.shape() different?

- DataFrame.size: Returns the number of elements in the DataFrame (total number of cells).

- DataFrame.shape: Returns a tuple representing the dimensionality of the DataFrame (number of rows and columns).

In [6]:
# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print("Original DataFrame:")
print(df)

# Using size and shape
print("\nDataFrame size:")
print(df.size)  # Output: 6 (3 rows * 2 columns)

print("\nDataFrame shape:")
print(df.shape)  # Output: (3, 2) (3 rows, 2 columns)


Original DataFrame:
   A  B
0  1  4
1  2  5
2  3  6

DataFrame size:
6

DataFrame shape:
(3, 2)


Q6. Which function of pandas do we use to read an excel file?

The function used to read an Excel file in pandas is read_excel().

In [15]:
'''
# Reading an Excel file
df = pd.read_excel('data.xlsx')
print("DataFrame from Excel file:")
print(df.head())
'''

'\n# Reading an Excel file\ndf = pd.read_excel(\'data.xlsx\')  \nprint("DataFrame from Excel file:")\nprint(df.head())\n'

Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.
The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

In [8]:
def extract_usernames(df):
    df['Username'] = df['Email'].apply(lambda x: x.split('@')[0])
    return df

# Example DataFrame
df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.doe@example.com']})
print("Original DataFrame:")
print(df)

# Extracting usernames
df_with_usernames = extract_usernames(df)
print("\nDataFrame with Username column:")
print(df_with_usernames)


Original DataFrame:
                  Email
0  john.doe@example.com
1  jane.doe@example.com

DataFrame with Username column:
                  Email  Username
0  john.doe@example.com  john.doe
1  jane.doe@example.com  jane.doe


Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.

In [9]:
def select_rows(df):
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows

# Example DataFrame
df = pd.DataFrame({'A': [3, 8, 6, 2, 9], 'B': [5, 2, 9, 3, 1], 'C': [1, 7, 4, 5, 2]})
print("Original DataFrame:")
print(df)

# Selecting rows
selected_df = select_rows(df)
print("\nSelected rows:")
print(selected_df)


Original DataFrame:
   A  B  C
0  3  5  1
1  8  2  7
2  6  9  4
3  2  3  5
4  9  1  2

Selected rows:
   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,
median, and standard deviation of the values in the 'Values' column.

In [10]:
def calculate_statistics(df):
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_dev = df['Values'].std()
    return mean_value, median_value, std_dev

# Example DataFrame
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
print("Original DataFrame:")
print(df)

# Calculating statistics
mean_value, median_value, std_dev = calculate_statistics(df)
print(f"\nMean: {mean_value}, Median: {median_value}, Standard Deviation: {std_dev}")


Original DataFrame:
   Values
0      10
1      20
2      30
3      40
4      50

Mean: 30.0, Median: 30.0, Standard Deviation: 15.811388300841896


Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and
should include the current day.

In [11]:
def add_moving_average(df):
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    return df

# Example DataFrame
df = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
    'Sales': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})
print("Original DataFrame:")
print(df)

# Adding moving average
df_with_moving_average = add_moving_average(df)
print("\nDataFrame with MovingAverage column:")
print(df_with_moving_average)


Original DataFrame:
        Date  Sales
0 2023-01-01     10
1 2023-01-02     20
2 2023-01-03     30
3 2023-01-04     40
4 2023-01-05     50
5 2023-01-06     60
6 2023-01-07     70
7 2023-01-08     80
8 2023-01-09     90
9 2023-01-10    100

DataFrame with MovingAverage column:
        Date  Sales  MovingAverage
0 2023-01-01     10           10.0
1 2023-01-02     20           15.0
2 2023-01-03     30           20.0
3 2023-01-04     40           25.0
4 2023-01-05     50           30.0
5 2023-01-06     60           35.0
6 2023-01-07     70           40.0
7 2023-01-08     80           50.0
8 2023-01-09     90           60.0
9 2023-01-10    100           70.0


Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.

In [12]:
def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.day_name()
    return df

# Example DataFrame
df = pd.DataFrame({'Date': pd.date_range(start='2023-01-01', periods=5, freq='D')})
print("Original DataFrame:")
print(df)

# Adding weekday column
df_with_weekday = add_weekday_column(df)
print("\nDataFrame with Weekday column:")
print(df_with_weekday)


Original DataFrame:
        Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05

DataFrame with Weekday column:
        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [13]:
def select_rows_between_dates(df):
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
    return df.loc[mask]

# Example DataFrame
df = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=60, freq='D'),
    'Values': range(60)
})
print("Original DataFrame:")
print(df)

# Selecting rows
selected_df = select_rows_between_dates(df)
print("\nSelected rows:")
print(selected_df)


Original DataFrame:
         Date  Values
0  2023-01-01       0
1  2023-01-02       1
2  2023-01-03       2
3  2023-01-04       3
4  2023-01-05       4
5  2023-01-06       5
6  2023-01-07       6
7  2023-01-08       7
8  2023-01-09       8
9  2023-01-10       9
10 2023-01-11      10
11 2023-01-12      11
12 2023-01-13      12
13 2023-01-14      13
14 2023-01-15      14
15 2023-01-16      15
16 2023-01-17      16
17 2023-01-18      17
18 2023-01-19      18
19 2023-01-20      19
20 2023-01-21      20
21 2023-01-22      21
22 2023-01-23      22
23 2023-01-24      23
24 2023-01-25      24
25 2023-01-26      25
26 2023-01-27      26
27 2023-01-28      27
28 2023-01-29      28
29 2023-01-30      29
30 2023-01-31      30
31 2023-02-01      31
32 2023-02-02      32
33 2023-02-03      33
34 2023-02-04      34
35 2023-02-05      35
36 2023-02-06      36
37 2023-02-07      37
38 2023-02-08      38
39 2023-02-09      39
40 2023-02-10      40
41 2023-02-11      41
42 2023-02-12      42
43 2023-02-1

Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to
be imported?

The first and foremost necessary library to be imported to use pandas is pandas

In [14]:
import pandas as pd
