## Pandas Advance assignment 1

Q1. List any five functions of the pandas library with execution.

1. read_csv(): This function is used to read a CSV (Comma Separated Values) file and convert it into a pandas DataFrame.
Example: Suppose we have a CSV file named data.csv in the same directory as our Python code. We can read it using the following code:

In [None]:
import pandas as pd
df = pd.read_csv('data.csv')


2.head(): This function is used to view the first n rows of a DataFrame. By default, it displays the first 5 rows.
Example: We can use the head() function on our df DataFrame to view the first 5 rows:

In [None]:
print(df.head())


3.describe(): This function is used to generate various summary statistics of a DataFrame, such as count, mean, standard deviation, minimum, maximum, and quartiles.
Example: We can use the describe() function on our df DataFrame to generate summary statistics:

In [None]:
print(df.describe())


4.groupby(): This function is used to group the rows of a DataFrame based on one or more 4. columns, and then apply a function to each group.
Example: Suppose we have a DataFrame named sales_df with columns Region, Product, and Sales. We can group the rows by Region and calculate the total sales for each region using the following code:

In [None]:
grouped_df = sales_df.groupby('Region').sum()


5.plot(): This function is used to create various types of plots, such as line plots, scatter plots, bar plots, and histograms. It is based on the matplotlib library.
Example: We can use the plot() function on our sales_df DataFrame to create a bar plot of the total sales for each region:

In [None]:
grouped_df.plot(kind='bar', y='Sales')


Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [2]:
import pandas as pd

def reindex_df(df):
    new_index = pd.RangeIndex(start=1, stop=2*len(df), step=2)
    df = df.set_index(new_index)
    return df


Explanation:

The pd.RangeIndex() function creates a new index that starts from 1 and increments by 2 for each row. The start parameter specifies the start value of the index, the stop parameter specifies the end value (exclusive), and the step parameter specifies the step size.
The set_index() function sets the new index for the DataFrame.
The function returns the re-indexed DataFrame.
You can use this function to re-index a DataFrame as follows:

In [3]:
# Example DataFrame
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]})

# Re-index the DataFrame
df = reindex_df(df)

# Print the re-indexed DataFrame
print(df)


    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.
For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should
calculate and print the sum of the first three values, which is 60.

In [4]:
import pandas as pd

def sum_first_three(df):
    values_sum = sum(df['Values'].iloc[:3])
    print("Sum of the first three values:", values_sum)


Explanation:

The sum() function calculates the sum of the first three values in the 'Values' column using the iloc indexer to slice the first three rows.
The function prints the sum to the console.
You can use this function to calculate the sum of the first three values in a DataFrame as follows:

In [5]:
# Example DataFrame
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

# Call the function
sum_first_three(df)


Sum of the first three values: 60


Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.

In [6]:
import pandas as pd

def add_word_count(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
    return df


Explanation:

The apply() function applies a lambda function to each row of the 'Text' column.
The lambda function uses the split() method to split the text in each row into words and then calculates the number of words using the len() function.
The new column 'Word_Count' is added to the DataFrame using the df['Word_Count'] = notation.
The function returns the modified DataFrame.
You can use this function to add a new column 'Word_Count' to a DataFrame 'df' as follows:

In [7]:
# Example DataFrame
df = pd.DataFrame({'Text': ['This is a sentence.', 'This is another sentence.', 'A third sentence.']})

# Call the function
df = add_word_count(df)

# Print the modified DataFrame
print(df)


                        Text  Word_Count
0        This is a sentence.           4
1  This is another sentence.           4
2          A third sentence.           3


Q5. How are DataFrame.size() and DataFrame.shape() different?

DataFrame.size and DataFrame.shape are both functions used in Pandas to retrieve information about the shape of a DataFrame, but they return different values.

DataFrame.size returns the total number of elements in the DataFrame, which is equal to the number of rows multiplied by the number of columns. This function returns a scalar value.

DataFrame.shape returns a tuple of integers representing the dimensions of the DataFrame, where the first element is the number of rows and the second element is the number of columns. This function returns a tuple.

In [8]:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

# Get the size of the DataFrame
df_size = df.size

# Get the shape of the DataFrame
df_shape = df.shape

# Print the results
print("DataFrame size:", df_size)
print("DataFrame shape:", df_shape)


DataFrame size: 9
DataFrame shape: (3, 3)


Q6. Which function of pandas do we use to read an excel file?

To read an Excel file using Pandas, we can use the read_excel() function.

In [None]:
import pandas as pd

# read the excel file into a pandas dataframe
df = pd.read_excel('example.xlsx', sheet_name='Sheet1')

# display the contents of the dataframe
print(df)


Explanation:

The pd.read_excel() function is used to read an Excel file and convert it to a Pandas DataFrame.
In the example, the function reads the Excel file example.xlsx and extracts data from the sheet named Sheet1.
The resulting DataFrame is stored in the variable df.
Finally, the contents of the DataFrame are printed to the console using the print() function.
You can modify the arguments of pd.read_excel() according to the specific Excel file you want to read. For example, you can specify the sheet name, column names, or index column using various optional parameters of the read_excel() function.

Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.
The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

Here is a Python function that extracts the username from the 'Email' column of a Pandas DataFrame and stores it in a new 'Username' column:

In [10]:
import pandas as pd

def extract_username(df):
    # Extract the username from the 'Email' column
    df['Username'] = df['Email'].apply(lambda x: x.split('@')[0])
    return df


Explanation:

The function takes a Pandas DataFrame df as input.
The function extracts the username from the 'Email' column by splitting each email address on the '@' symbol using the split() function and selecting the first element of the resulting list using a lambda function and the apply() function.
The extracted usernames are then stored in a new 'Username' column using the assignment operator =.
Finally, the function returns the modified DataFrame.
You can call this function with your DataFrame as input to create the new 'Username' column. For example:

In [11]:
# Create a sample DataFrame
df = pd.DataFrame({'Email': ['susmith@example.com', 'luffy@example.com']})

# Call the function to extract the usernames
df = extract_username(df)

# Print the resulting DataFrame
print(df)


                 Email Username
0  susmith@example.com  susmith
1    luffy@example.com    luffy


Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.
For example, if df contains the following values:
A B C

0 3 5 1

1 8 2 7

2 6 9 4

3 2 3 5

4 9 1 2

Here is a Python function that selects all rows from a Pandas DataFrame df where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10:

In [12]:
import pandas as pd

def select_rows(df):
    # Select rows where 'A' > 5 and 'B' < 10
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows


Explanation:

The function takes a Pandas DataFrame df as input.
The function selects rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10 using the & operator to combine two conditions.
The selected rows are stored in a new DataFrame called selected_rows.
Finally, the function returns the new DataFrame.
You can call this function with your DataFrame as input to select the desired rows. For example:

In [13]:
# Create a sample DataFrame
df = pd.DataFrame({'A': [3, 8, 6, 2, 9],
                   'B': [5, 2, 9, 3, 1],
                   'C': [1, 7, 4, 5, 2]})

# Call the function to select rows where 'A' > 5 and 'B' < 10
selected_rows = select_rows(df)

# Print the resulting DataFrame
print(selected_rows)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2
