# **Pandas Assignment 1**

### **Q1. List any five functions of the pandas library with execution.**

Sure! Here are five commonly used functions in the pandas library along with examples of their execution:

1. **`read_csv`**: This function is used to read a comma-separated values (CSV) file into a DataFrame.

2. **`head`**: This function returns the first n rows of a DataFrame.

3. **`describe`**: This function generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

4. **`groupby`**: This function is used to split the data into groups based on some criteria and apply a function to each group independently.

5. **`merge`**: This function is used to merge DataFrame objects by performing a database-style join operation.

Let's execute these functions step by step.

Let's run this code to see the outputs.

It seems like I can’t do more advanced data analysis right now. Please try again later.

In the meantime, I can still explain how each of these functions works conceptually. Would you like me to do that?

In [1]:
import pandas as pd

# Create sample data and save to a CSV file
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [24, 27, 22, 32, 29],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']}
df = pd.DataFrame(data)
df.to_csv('sample.csv', index=False)

# 1. read_csv
df_read = pd.read_csv('sample.csv')
print("DataFrame read from CSV:")
print(df_read)

# 2. head
df_head = df_read.head()
print("\nFirst few rows of the DataFrame:")
print(df_head)

# 3. describe
df_describe = df_read.describe()
print("\nDescriptive statistics of the DataFrame:")
print(df_describe)

# 4. groupby
grouped = df_read.groupby('City')['Age'].mean()
print("\nGrouped by 'City' and mean of other columns:")
print(grouped)

# 5. merge
data2 = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
         'Salary': [70000, 80000, 75000, 90000, 85000]}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df_read, df2, on='Name')
print("\nMerged DataFrame:")
print(merged_df)


DataFrame read from CSV:
      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston
4      Eve   29      Phoenix

First few rows of the DataFrame:
      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston
4      Eve   29      Phoenix

Descriptive statistics of the DataFrame:
             Age
count   5.000000
mean   26.800000
std     3.962323
min    22.000000
25%    24.000000
50%    27.000000
75%    29.000000
max    32.000000

Grouped by 'City' and mean of other columns:
City
Chicago        22.0
Houston        32.0
Los Angeles    27.0
New York       24.0
Phoenix        29.0
Name: Age, dtype: float64

Merged DataFrame:
      Name  Age         City  Salary
0    Alice   24     New York   70000
1      Bob   27  Los Angeles   80000
2  Charlie   22      Chicago   75000
3    David   32      Houston   90000
4      Eve   29  

### **Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.**

In [2]:
import pandas as pd

def reindex_dataframe(df):

    new_index = range(1, 2*len(df)+1, 2)
    

    df.index = new_index
    
    return df


data = {
        'A': [10, 20, 30],
        'B': [40, 50, 60],
        'C': [70, 80, 90]
        }
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)


df_reindexed = reindex_dataframe(df)

print("\nRe-indexed DataFrame:")
print(df_reindexed)


Original DataFrame:
    A   B   C
0  10  40  70
1  20  50  80
2  30  60  90

Re-indexed DataFrame:
    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


### **Q3 You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.**

In [3]:
data = {
    'Values' : [1,2,3,4,5,6,6,7,8,9]
}

df = pd.DataFrame(data)
# print(df['Values'].sum())
sum = 0
for i in df['Values']:
    sum += i
print(sum)

sum_of_first_three = 0
j=3
while j!=0:
    sum_of_first_three += df['Values'][j-1]
    j -= 1

print(sum_of_first_three)

51
6


### **Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.**

In [4]:
data = {
    'Text' : ['This is Pandas Assignment 1.', 'There are more', 'But i am not done yet ', 'mura mura mura mura mura ']
}

df = pd.DataFrame(data)

df['Word_Count'] = df['Text'].str.split().str.len()

df

Unnamed: 0,Text,Word_Count
0,This is Pandas Assignment 1.,5
1,There are more,3
2,But i am not done yet,6
3,mura mura mura mura mura,5


### **Q5. How are DataFrame.size() and DataFrame.shape() different?**

The `DataFrame.size` and `DataFrame.shape` attributes in pandas provide different pieces of information about the DataFrame:

1. **`DataFrame.size`**:
   - **Description**: This attribute returns the number of elements in the DataFrame.
   - **Calculation**: It is calculated as the product of the number of rows and the number of columns.
   - **Example**: For a DataFrame with 10 rows and 5 columns, `DataFrame.size` would be \(10 \times 5 = 50\).

2. **`DataFrame.shape`**:
   - **Description**: This attribute returns a tuple representing the dimensionality of the DataFrame.
   - **Content**: The tuple contains two values: the number of rows and the number of columns.
   - **Example**: For a DataFrame with 10 rows and 5 columns, `DataFrame.shape` would be `(10, 5)`.

Here's an example to illustrate the difference:

```python
import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)

# Using DataFrame.size
print("DataFrame size:", df.size)  # Output: 9 (3 rows * 3 columns)

# Using DataFrame.shape
print("DataFrame shape:", df.shape)  # Output: (3, 3) (3 rows, 3 columns)
```

Explanation:
- **`df.size`**: Returns 9 because there are 3 rows and 3 columns, resulting in 9 elements in total.
- **`df.shape`**: Returns (3, 3) indicating that the DataFrame has 3 rows and 3 columns.

In summary:
- `DataFrame.size` gives the total number of elements (rows × columns).
- `DataFrame.shape` provides the dimensions of the DataFrame (number of rows, number of columns).

### **Q6. Which function of pandas do we use to read an excel file?**

### **We Use "pandas.read_excel()"**
```python

import pandas as pd

# Read the second sheet of the Excel file, skip the first row, and use custom column names
df = pd.read_excel('path_to_your_file.xlsx', sheet_name='Sheet2', header=1, names=['A', 'B', 'C'], usecols="A:C")

# Display the DataFrame
print(df)
```

### **Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.**

In [11]:
data = {
    'email': ["karanparmar00014@gmail.com", "karanparmar00015@gmail.com", "karanparmar00016@gmail.com", "karanparmar00017@gmail.com", "karanparmar00018@gmail.com", "karanparmar00019@", "karanparmar000110@gmail.com"]
}

df = pd.DataFrame(data)

print("Original DataFrame:")

print(df)

df['Username'] = df['email'].apply(lambda x: x.split('@')[0])

df

Original DataFrame:
                         email
0   karanparmar00014@gmail.com
1   karanparmar00015@gmail.com
2   karanparmar00016@gmail.com
3   karanparmar00017@gmail.com
4   karanparmar00018@gmail.com
5            karanparmar00019@
6  karanparmar000110@gmail.com


Unnamed: 0,email,Username
0,karanparmar00014@gmail.com,karanparmar00014
1,karanparmar00015@gmail.com,karanparmar00015
2,karanparmar00016@gmail.com,karanparmar00016
3,karanparmar00017@gmail.com,karanparmar00017
4,karanparmar00018@gmail.com,karanparmar00018
5,karanparmar00019@,karanparmar00019
6,karanparmar000110@gmail.com,karanparmar000110


### **Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows. For example, if df contains the following values:**
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2

In [22]:
data = {
    'A': [3, 8, 6, 2, 9],
    'B': [5, 2, 9, 3, 1],
    'C': [1, 7, 4, 5, 2]
}

df = pd.DataFrame(data)

print(df)
print("\n \n")

def new(df):
    new_df = df[(df['A']>5) & (df['B']<10)]
    print(new_df)
new(df)



   A  B  C
0  3  5  1
1  8  2  7
2  6  9  4
3  2  3  5
4  9  1  2

 

   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


### **Q9 Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.**

In [23]:
import pandas as pd

def calculate_statistics(df):
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_deviation = df['Values'].std()

    return mean_value, median_value, std_deviation


data = {
    'Values': [10, 20, 30, 40, 50]
}

df = pd.DataFrame(data)
mean, median, std_dev = calculate_statistics(df)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev}")


Mean: 30.0
Median: 30.0
Standard Deviation: 15.811388300841896


### **Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.**

In [24]:
import pandas as pd

def calculate_moving_average(df):
    
    df['Date'] = pd.to_datetime(df['Date'])
    
    df = df.sort_values(by='Date')
    
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    
    return df

data = {
    'Date': ['2024-05-01', '2024-05-02', '2024-05-03', '2024-05-04', '2024-06-05', '2024-06-06', '2024-06-07', '2024-06-08'],
    'Sales': [100, 150, 200, 250, 300, 350, 400, 450]
}

df = pd.DataFrame(data)
df = calculate_moving_average(df)
print(df)


        Date  Sales  MovingAverage
0 2024-05-01    100          100.0
1 2024-05-02    150          125.0
2 2024-05-03    200          150.0
3 2024-05-04    250          175.0
4 2024-06-05    300          200.0
5 2024-06-06    350          225.0
6 2024-06-07    400          250.0
7 2024-06-08    450          300.0


Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.
For example, if df contains the following values:
Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:

Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday
The function should return the modified DataFrame.

In [28]:
import pandas as pd

def add_weekday_column(df):
    df['Date'] = pd.to_datetime(df['Date'])
    
    df['Weekday'] = df['Date'].dt.day_name()
    
    return df

data = {
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']
}

df = pd.DataFrame(data)
df = add_weekday_column(df)
print(df)
data_index = df.set_index('Date')
data_index
# df = pd.read_csv(data_index, index_col = "Date", parse_dates = True)

        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


Unnamed: 0_level_0,Weekday
Date,Unnamed: 1_level_1
2023-01-01,Sunday
2023-01-02,Monday
2023-01-03,Tuesday
2023-01-04,Wednesday
2023-01-05,Thursday


In [29]:
import pandas as pd

def select_date_range(df):
    df['Date'] = pd.to_datetime(df['Date'])
    
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    
    mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
    filtered_df = df.loc[mask]
    
    return filtered_df

data = {
    'Date': ['2023-01-01', '2023-01-15', '2023-02-01', '2022-12-31', '2023-01-31'],
    'Value': [10, 20, 30, 40, 50]
}

df = pd.DataFrame(data)
filtered_df = select_date_range(df)
print(filtered_df)


        Date  Value
0 2023-01-01     10
1 2023-01-15     20
4 2023-01-31     50


### **Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?**

To use the basic functions of pandas, the first and foremost necessary library that needs to be imported is pandas itself.

# **Complete**