<a href="https://colab.research.google.com/github/ranamaddy/Introduction-to-Coding-in-Pandas-Using-Python/blob/main/Lesson_2_Data_Manipulation_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 2: Data Manipulation with Pandas
- Accessing and manipulating DataFrame columns and rows
- Filtering and selecting data from a DataFrame
- Handling missing data (NaN values)
- Data sorting and ranking
- Applying mathematical and statistical operations to data
- Working with dates and times in Pandas

Lesson 2: Data Manipulation with Pandas

In this lesson, we will explore various techniques for manipulating data using Pandas. Data manipulation is a crucial step in data analysis as it allows us to transform and reshape the data to extract meaningful insights. We will cover important topics such as data selection, filtering, grouping, and aggregation.

**By the end of this lesson, you will be able to:**

- Select specific data from a DataFrame using indexing and slicing techniques.
- Filter data based on certain conditions using boolean indexing.
- Group data based on one or more columns and perform aggregate calculations.
- Handle missing data by identifying and dealing with null or NaN values.
- Perform operations on data, such as sorting, merging, and concatenating DataFrames.

Throughout the lesson, we will work on practical examples and exercises to solidify your understanding of data manipulation techniques in Pandas. These skills are essential for effectively working with data and gaining meaningful insights.

Now, let's dive into the first topic: data selection.

# Accessing and manipulating DataFrame columns and rows
Accessing and manipulating DataFrame columns and rows is a fundamental aspect of data manipulation in Pandas. In this section, we will learn various techniques to select and modify specific columns and rows in a DataFrame.

1. **Accessing Columns**:

- To access a specific column in a DataFrame, you can use square bracket notation or the dot notation.
- Example

In [3]:
import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Emma', 'Mike'],
                   'Age': [25, 28, 30],
                   'Country': ['USA', 'UK', 'Canada']})

# Square bracket notation
name_col = df['Name']

# Dot notation
age_col = df.Age

print(name_col)
print(age_col)
print("\nDataframe \n")
df


0    John
1    Emma
2    Mike
Name: Name, dtype: object
0    25
1    28
2    30
Name: Age, dtype: int64

Dataframe 



Unnamed: 0,Name,Age,Country
0,John,25,USA
1,Emma,28,UK
2,Mike,30,Canada


 In the example above, we accessed the 'Name' column using both square bracket notation **(df['Name'])** and dot notation **(df.Name)**, and assigned them to variables. Both methods yield the same result.

2. **Accessing Rows:**

- To access specific rows in a DataFrame, you can use the loc or iloc attribute, which allow indexing by labels or integer positions, respectively.
- Example:

In [4]:
# Using loc attribute
row_0 = df.loc[0]  # Access the first row

# Using iloc attribute
row_1 = df.iloc[1]  # Access the second row

print(row_0)
print(row_1)


Name       John
Age          25
Country     USA
Name: 0, dtype: object
Name       Emma
Age          28
Country      UK
Name: 1, dtype: object


**In the example** above, we accessed specific rows using the loc and iloc attributes, specifying the row index. This allows us to retrieve the entire row as a Series object.

3. **Modifying Columns and Rows:**

- You can modify existing columns or add new columns to a DataFrame.
- Example:

In [5]:
# Modifying a column
df['Age'] = df['Age'] + 1  # Increment each age value by 1

# Adding a new column
df['Profession'] = ['Engineer', 'Teacher', 'Doctor']  # Assign a list of values to a new column

print(df)


   Name  Age Country Profession
0  John   26     USA   Engineer
1  Emma   29      UK    Teacher
2  Mike   31  Canada     Doctor


**In the example** above, we modified the 'Age' column by incrementing each value by 1. We also added a new 'Profession' column by assigning a list of values. The DataFrame reflects these modifications.

These techniques provide you with the ability to access and manipulate specific columns and rows in a DataFrame. This level of control allows you to extract relevant data, perform calculations, and modify the structure of the DataFrame to suit your analysis needs.

**There are additional techniques and examples related to accessing and manipulating DataFrame columns and rows. Here are a few more:**

1. **Accessing Multiple Columns:**

- To access multiple columns in a DataFrame, you can pass a list of column names within the square brackets.
- Example:

In [6]:
import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Emma', 'Mike'],
                   'Age': [25, 28, 30],
                   'Country': ['USA', 'UK', 'Canada']})

selected_cols = df[['Name', 'Country']]

print(selected_cols)


   Name Country
0  John     USA
1  Emma      UK
2  Mike  Canada


**In the example above**, we accessed the 'Name' and 'Country' columns by passing a list of column names ['Name', 'Country'] within the square brackets.

2. **Modifying Rows:**

- You can modify specific rows in a DataFrame by assigning new values to them.
- Example:

In [7]:
# Modify a specific row by index
df.loc[2] = ['Michael', 31, 'USA']

print(df)


      Name  Age Country
0     John   25     USA
1     Emma   28      UK
2  Michael   31     USA


**In the example above**, we modified the values in the third row (index 2) by assigning a list of new values ['Michael', 31, 'USA'] to that row

These additional techniques expand your understanding of accessing and manipulating DataFrame columns and rows. They provide you with more flexibility in working with specific subsets of data and making targeted modifications to the DataFrame.

# Filtering and selecting data from a DataFrame

Filtering and selecting data from a DataFrame is an essential skill in data analysis. It allows you to extract specific subsets of data based on certain conditions or criteria. Let's explore how to filter and select data from a DataFrame as a beginner student:

1. **Filtering Rows based on Conditions:**

- To filter rows based on specific conditions, you can use boolean indexing. This involves creating a boolean condition that evaluates to True or False for each row in the DataFrame.
- Example:

In [8]:
import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Emma', 'Mike'],
                   'Age': [25, 28, 30],
                   'Country': ['USA', 'UK', 'Canada']})

# Filter rows where Age is greater than 25
filtered_rows = df[df['Age'] > 25]

print(filtered_rows)


   Name  Age Country
1  Emma   28      UK
2  Mike   30  Canada


**Explanation**: In this example, we filtered the rows where the 'Age' column value is greater than 25. The condition **df['Age'] > 25** creates a boolean Series with **True** for rows that satisfy the condition and **False** for rows that don't. By passing this boolean Series inside the square brackets of the DataFrame **df**, we select only the rows where the condition is **True**.

2. **Selecting Columns:**

- To select specific columns from a DataFrame, you can use square bracket notation, passing a list of column names you want to select.
- Example:

In [11]:
import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Emma', 'Mike'],
                   'Age': [25, 28, 30],
                   'Country': ['USA', 'UK', 'Canada']})

# Selecting specific columns
selected_cols = df[['Name', 'Country']]

print(selected_cols)
df


   Name Country
0  John     USA
1  Emma      UK
2  Mike  Canada


Unnamed: 0,Name,Age,Country
0,John,25,USA
1,Emma,28,UK
2,Mike,30,Canada


**Explanation**: In this example, we selected the 'Name' and 'Country' columns by passing a list of column **names ['Name', 'Country']** inside the square brackets. The resulting DataFrame contains only the selected columns.

3. **Combining Filters:**

- You can combine multiple conditions using logical operators **like & (and), | (or), and ~ (not)** to create complex filters.
- Example:

In [12]:
# Filter rows where Age is greater than 25 and Country is 'USA'
filtered_rows = df[(df['Age'] > 20) & (df['Country'] == 'USA')]

print(filtered_rows)


   Name  Age Country
0  John   25     USA


**Explanation**: In this example, we filtered the rows where the 'Age' column value is greater than 25 and the 'Country' column value is 'USA'. We used the logical operator & to combine the conditions within parentheses.


By mastering these techniques, you can effectively filter and select specific subsets of data from a DataFrame. This enables you to focus on the relevant data for analysis and gain insights from the specific criteria or conditions you define.

**five more examples** of filtering and selecting data from a DataFrame using various operators:

1. Using the "isin" operator to select rows where a column's value is in a list:

In [14]:
df[df['Name'].isin(['pakistan', 'India'])]


Unnamed: 0,Name,Age,Country


In [16]:
df[df['Name'].isin(['John', 'India'])]

Unnamed: 0,Name,Age,Country
0,John,25,USA


2. Using the "not" operator to select rows where a column's value is not equal to a certain value:

In [19]:
df[df['Name'] != 'John']


Unnamed: 0,Name,Age,Country
1,Emma,28,UK
2,Mike,30,Canada


3. Using the "between" operator to select rows where a column's value is between two values:

In [21]:
df[df['Age'].between(10, 29)]


Unnamed: 0,Name,Age,Country
0,John,25,USA
1,Emma,28,UK


4. Using the "startswith" operator to select rows where a column's value starts with a certain string:

In [23]:
df[df['Name'].str.startswith('J')]


Unnamed: 0,Name,Age,Country
0,John,25,USA


5. Using the "query" method to select rows based on a conditional expression:

In [24]:
df.query('Age > 3 and Name == "John"')


Unnamed: 0,Name,Age,Country
0,John,25,USA


# Handling missing data (NaN values)


Handling missing data, represented as NaN (Not a Number) values, is an important task in data analysis. NaN values can occur when data is incomplete, unavailable, or improperly recorded. As a beginner student, it's essential to understand how to identify and deal with missing data effectively. Here are some techniques for handling missing data in a DataFrame:

1. **Identifying Missing Data:**

- To identify missing data in a DataFrame, you can use the isnull() or isna() methods. These methods return a DataFrame of the same shape as the original, where each cell is either True if it contains a NaN value or False if it contains a valid value.
- Example:

In [27]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})

print(df)
# Identifying missing data
missing_data = df.isnull()

print(missing_data)


     A    B
0  1.0  5.0
1  2.0  NaN
2  NaN  7.0
3  4.0  8.0
       A      B
0  False  False
1  False   True
2   True  False
3  False  False


**Explanation**: In this example, we used the isnull() method to identify missing data in the DataFrame df. The resulting DataFrame, missing_data, contains True values where there are NaN values and False values where the data is present.

2. **Handling Missing Data:**

- Once missing data is identified, there are several ways to handle it. Common techniques include:
 - Dropping rows or columns: You can use the dropna() method to remove rows or columns that contain any NaN values.
 - Filling missing values: You can use the fillna() method to replace NaN values with specific values, such as the mean, median, or a custom value.
- Example:

In [28]:
# Dropping rows with missing data
df_dropped = df.dropna()

# Filling missing values with the mean
df_filled = df.fillna(df.mean())

print(df_dropped)
print(df_filled)


     A    B
0  1.0  5.0
3  4.0  8.0
          A         B
0  1.000000  5.000000
1  2.000000  6.666667
2  2.333333  7.000000
3  4.000000  8.000000


Explanation: In this example, we dropped the rows with missing data using the **dropna()** method, resulting in the DataFrame **df_dropped**. We also filled the missing values with the mean of each column using the **fillna()** method, resulting in the DataFrame **df_filled**.

**Handling missing data is crucial to ensure the accuracy and reliability of your data analysis. By identifying and addressing missing data appropriately, you can avoid biased or incorrect results.**

**Let's consider an example with a dataset to demonstrate handling missing data using pandas.**

Suppose we have a dataset that contains information about students' test scores and their corresponding ages. However, some data points are missing. Here's how you can handle missing data in this scenario:

In [29]:
import pandas as pd

# Create a DataFrame with missing data
data = {'Name': ['John', 'Emma', 'Mike', 'Sarah'],
        'Age': [25, None, 30, 27],
        'Score': [80, 92, None, 75]}

df = pd.DataFrame(data)

# Identifying missing data
missing_data = df.isnull()
print("Missing Data:\n", missing_data)

# Dropping rows with missing data
df_dropped = df.dropna()
print("\nDropped Rows with Missing Data:\n", df_dropped)

# Filling missing values with mean
df_filled = df.fillna(df.mean())
print("\nFilled Missing Values:\n", df_filled)


Missing Data:
     Name    Age  Score
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False

Dropped Rows with Missing Data:
     Name   Age  Score
0   John  25.0   80.0
3  Sarah  27.0   75.0

Filled Missing Values:
     Name        Age      Score
0   John  25.000000  80.000000
1   Emma  27.333333  92.000000
2   Mike  30.000000  82.333333
3  Sarah  27.000000  75.000000


  df_filled = df.fillna(df.mean())


**Explanation**:

- In the given example, we created a DataFrame df with missing data. Some entries in the '**Age**' and '**Score**' columns are set to **None**.
- Using the **isnull()** method, we identified the missing data and obtained a DataFrame **missing_data** where True represents missing values.
- Next, we used the **dropna()** method to remove rows with missing data, resulting in the DataFrame df_dropped.
- To fill the missing values, we used the **fillna()** method with the mean of each column from **df**. The resulting DataFrame is **df_filled**.
- As a result, **df_dropped** contains only the row with complete data, and **df_filled** replaces the missing values with the column means

By dropping or filling missing data appropriately, you can handle missing values in a dataset and proceed with your analysis accurately.

In [30]:
import pandas as pd

df = pd.read_csv('data.csv')
df

Unnamed: 0,Student ID,Class,Study hrs,Sleeping hrs,Social Media usage hrs,Mobile Games hrs,Percantege
0,1001,10,2.0,9,3.0,5.0,50
1,1002,10,6.0,8,2.0,0.0,80
2,1003,10,3.0,8,2.0,,60
3,1004,11,0.0,10,1.0,5.0,45
4,1005,11,4.0,7,,0.0,75
5,1006,11,,7,0.0,0.0,96
6,1007,12,4.0,6,0.0,0.0,80
7,1008,12,10.0,6,2.0,0.0,90
8,1009,12,2.0,8,2.0,4.0,60
9,1010,12,6.0,9,1.0,0.0,85


In [33]:
# Identifying missing data
missing_data = df.isnull()
print("Missing Data:\n")
missing_data

Missing Data:



Unnamed: 0,Student ID,Class,Study hrs,Sleeping hrs,Social Media usage hrs,Mobile Games hrs,Percantege
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False
4,False,False,False,False,True,False,False
5,False,False,True,False,False,False,False
6,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False


In [34]:
# Dropping rows with missing data
df_dropped = df.dropna()
print("\nDropped Rows with Missing Data:\n")
df_dropped


Dropped Rows with Missing Data:



Unnamed: 0,Student ID,Class,Study hrs,Sleeping hrs,Social Media usage hrs,Mobile Games hrs,Percantege
0,1001,10,2.0,9,3.0,5.0,50
1,1002,10,6.0,8,2.0,0.0,80
3,1004,11,0.0,10,1.0,5.0,45
6,1007,12,4.0,6,0.0,0.0,80
7,1008,12,10.0,6,2.0,0.0,90
8,1009,12,2.0,8,2.0,4.0,60
9,1010,12,6.0,9,1.0,0.0,85


In [35]:
# Filling missing values with mean
df_filled = df.fillna(df.mean())
print("\nFilled Missing Values:\n")

df_filled


Filled Missing Values:



Unnamed: 0,Student ID,Class,Study hrs,Sleeping hrs,Social Media usage hrs,Mobile Games hrs,Percantege
0,1001,10,2.0,9,3.0,5.0,50
1,1002,10,6.0,8,2.0,0.0,80
2,1003,10,3.0,8,2.0,1.555556,60
3,1004,11,0.0,10,1.0,5.0,45
4,1005,11,4.0,7,1.444444,0.0,75
5,1006,11,4.111111,7,0.0,0.0,96
6,1007,12,4.0,6,0.0,0.0,80
7,1008,12,10.0,6,2.0,0.0,90
8,1009,12,2.0,8,2.0,4.0,60
9,1010,12,6.0,9,1.0,0.0,85


# Data sorting and ranking

Data sorting and ranking are essential operations in data analysis that allow you to organize and order your data based on specific criteria. Let's explore these concepts as a beginner student:

1. **Data Sorting**:

- Sorting data refers to arranging the rows of a DataFrame or Series in a specific order based on one or more columns.
- You can sort data in either ascending (smallest to largest) or descending (largest to smallest) order.
- Example:

In [40]:
import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Emma', 'Mike'],
                   'Age': [25, 28, 30],
                   'Score': [80, 92, 75]})

print('\n before sorting\n',df)
# Sorting by a single column

print("\nafter sorting\n")
sorted_df = df.sort_values('Score')

print(sorted_df)



 before sorting
    Name  Age  Score
0  John   25     80
1  Emma   28     92
2  Mike   30     75

after sorting

   Name  Age  Score
2  Mike   30     75
0  John   25     80
1  Emma   28     92


**Explanation**: In this example, we sorted the DataFrame **df** based on the 'Score' column using the **sort_values**() method. The resulting DataFrame **sorted_df** is arranged in ascending order of '**Score**', with the lowest score appearing first.

# Data Ranking:

- Ranking data involves assigning a rank to each value in a column based on their order.
- The rank indicates the position of a value relative to others, considering the sorting order and handling ties appropriately.
- Example:

In [41]:
import pandas as pd

df = pd.DataFrame({'Name': ['John', 'Emma', 'Mike', 'Sarah'],
                   'Score': [80, 92, 75, 92]})

# Ranking the 'Score' column
df['Rank'] = df['Score'].rank(ascending=False, method='average')

print(df)


    Name  Score  Rank
0   John     80   3.0
1   Emma     92   1.5
2   Mike     75   4.0
3  Sarah     92   1.5


**Explanation**: In this example, we ranked the 'Score' column using the **rank**() method. The ascending=False parameter ensures a descending rank order. The **method**='**average**' parameter handles ties by assigning an average rank. The resulting DataFrame contains a new '**Rank**' column indicating the ranks of each score.

Sorting and ranking data help you organize and understand your data better. By sorting data, you can identify patterns, outliers, or arrange it for further analysis. Ranking provides a comparative measure of values, allowing you to identify the relative positions of individual entries. These operations are valuable tools in data analysis and visualization.

# Applying mathematical and statistical operations to data

Applying mathematical and statistical operations to data is a fundamental aspect of data analysis. These operations allow you to gain insights, summarize data, and make informed decisions. Let's explore some common mathematical and statistical operations as a beginner:

1. **Mathematical Operations**:

- Mathematical operations involve performing calculations on numerical data, such as addition, subtraction, multiplication, and division.
- You can apply these operations to individual columns or combine multiple columns to create new computed columns.
- Example:

In [42]:
import pandas as pd

df = pd.DataFrame({'A': [5, 10, 15],
                   'B': [2, 4, 6]})

# Addition
df['C'] = df['A'] + df['B']

# Multiplication
df['D'] = df['A'] * df['B']

print(df)


    A  B   C   D
0   5  2   7  10
1  10  4  14  40
2  15  6  21  90


**Explanation**: In this example, we performed addition and multiplication operations on the columns '**A**' and '**B**' to create new computed columns '**C**' and '**D**'. The resulting DataFrame df contains the computed values.

**Example with dataset**

In [43]:
import pandas as pd

df = pd.read_csv('data.csv')
df

Unnamed: 0,Student ID,Class,Study hrs,Sleeping hrs,Social Media usage hrs,Mobile Games hrs,Percantege
0,1001,10,2.0,9,3.0,5.0,50
1,1002,10,6.0,8,2.0,0.0,80
2,1003,10,3.0,8,2.0,,60
3,1004,11,0.0,10,1.0,5.0,45
4,1005,11,4.0,7,,0.0,75
5,1006,11,,7,0.0,0.0,96
6,1007,12,4.0,6,0.0,0.0,80
7,1008,12,10.0,6,2.0,0.0,90
8,1009,12,2.0,8,2.0,4.0,60
9,1010,12,6.0,9,1.0,0.0,85


In [45]:
# Addition
df['Class + Percantege'] = df['Class'] + df['Percantege']

df

Unnamed: 0,Student ID,Class,Study hrs,Sleeping hrs,Social Media usage hrs,Mobile Games hrs,Percantege,Class + Percantege
0,1001,10,2.0,9,3.0,5.0,50,60
1,1002,10,6.0,8,2.0,0.0,80,90
2,1003,10,3.0,8,2.0,,60,70
3,1004,11,0.0,10,1.0,5.0,45,56
4,1005,11,4.0,7,,0.0,75,86
5,1006,11,,7,0.0,0.0,96,107
6,1007,12,4.0,6,0.0,0.0,80,92
7,1008,12,10.0,6,2.0,0.0,90,102
8,1009,12,2.0,8,2.0,4.0,60,72
9,1010,12,6.0,9,1.0,0.0,85,97


2. **Statistical Operations:**

- Statistical operations involve analyzing data to derive meaningful insights and draw conclusions. They include measures such as mean, median, standard deviation, and correlation.
- These operations are applied to numerical columns and provide information about the central tendency, dispersion, and relationships within the data.
- Example:

In [46]:
import pandas as pd

df = pd.DataFrame({'A': [5, 10, 15, 20, 25],
                   'B': [2, 4, 6, 8, 10]})

# Mean
mean_A = df['A'].mean()
mean_B = df['B'].mean()

# Median
median_A = df['A'].median()
median_B = df['B'].median()

# Standard Deviation
std_A = df['A'].std()
std_B = df['B'].std()

# Correlation
correlation = df['A'].corr(df['B'])

print("Mean:", mean_A, mean_B)
print("Median:", median_A, median_B)
print("Standard Deviation:", std_A, std_B)
print("Correlation:", correlation)


Mean: 15.0 6.0
Median: 15.0 6.0
Standard Deviation: 7.905694150420948 3.1622776601683795
Correlation: 1.0


**Explanation**: In this example, we computed the mean, median, standard deviation, and correlation between columns 'A' and 'B' using various statistical methods. These measures provide insights into the central tendency, spread, and relationship between the columns.

By applying mathematical and statistical operations to your data, you can uncover patterns, summarize information, and make informed decisions. These operations form the foundation of data analysis and play a crucial role in extracting meaningful insights from your datasets.

In [50]:
import pandas as pd

df = pd.read_csv('data.csv')
df['Mean']=df['Percantege'].mean()
df['Median']=df['Percantege'].median()

df


Unnamed: 0,Student ID,Class,Study hrs,Sleeping hrs,Social Media usage hrs,Mobile Games hrs,Percantege,Mean,Median
0,1001,10,2.0,9,3.0,5.0,50,72.1,77.5
1,1002,10,6.0,8,2.0,0.0,80,72.1,77.5
2,1003,10,3.0,8,2.0,,60,72.1,77.5
3,1004,11,0.0,10,1.0,5.0,45,72.1,77.5
4,1005,11,4.0,7,,0.0,75,72.1,77.5
5,1006,11,,7,0.0,0.0,96,72.1,77.5
6,1007,12,4.0,6,0.0,0.0,80,72.1,77.5
7,1008,12,10.0,6,2.0,0.0,90,72.1,77.5
8,1009,12,2.0,8,2.0,4.0,60,72.1,77.5
9,1010,12,6.0,9,1.0,0.0,85,72.1,77.5


# Working with dates and times in Pandas

Working with dates and times in Pandas is essential when dealing with time series data or analyzing temporal trends. Pandas provides powerful tools to handle dates and times effectively. Let's explore the basics of working with dates and times as a beginner student:

1. **Representing Dates and Times:**

- In Pandas, dates and times are represented using the datetime data type, which provides various attributes and methods for manipulating and extracting information.
- You can create a datetime object using the pd.to_datetime() function by passing a string or a sequence of strings representing dates.
- Example:

In [51]:
import pandas as pd

# Creating a datetime object
date = pd.to_datetime('2022-05-01')
print(date)


2022-05-01 00:00:00


**Explanation**: In this example, we created a datetime object representing the date '2022-05-01' using the pd.to_datetime() function.

2. **Working with Date and Time Columns:**

- Pandas provides a specialized data structure called Timestamp to handle date and time data in individual DataFrame columns or Series.
- You can access various attributes and methods of a Timestamp object to extract specific information like year, month, day, hour, minute, and more.
- Example:

In [55]:
import pandas as pd

# Creating a DataFrame with dates
df = pd.DataFrame({'Date': ['2022-05-01', '2022-06-01', '2022-07-01'],
                   'Sales': [100, 150, 200]})

print("Dataset \n", df)

print("\n\n")

# Converting the 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Extracting year and month
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

print(df)


Dataset 
          Date  Sales
0  2022-05-01    100
1  2022-06-01    150
2  2022-07-01    200



        Date  Sales  Year  Month
0 2022-05-01    100  2022      5
1 2022-06-01    150  2022      6
2 2022-07-01    200  2022      7


**Explanation**: In this example, we created a DataFrame df with a 'Date' column. We converted the 'Date' column to datetime using pd.**to_datetime**() and extracted the year and month using the .dt.year and .dt.**month** attributes, respectively.

Working with dates and times in Pandas allows you to perform various operations such as filtering, grouping, and aggregating data based on time-based criteria. It enables you to analyze temporal patterns, trends, and relationships in your data effectively.

# Here are 10 assignment questions related to working with Pandas and data analysis:

1. Load a dataset of your choice into a Pandas DataFrame and perform basic exploratory data analysis, including checking the data types, summary statistics, and identifying missing values.

2. Filter a DataFrame to select rows where a specific column meets a certain condition, such as selecting all students with a score above 90.

3. Calculate the mean, median, and standard deviation of a numerical column in a DataFrame and interpret the results in the context of the dataset.

4. Merge two DataFrames based on a common column and analyze the merged dataset to gain insights or answer specific questions.

5. Reshape a DataFrame using pivoting or melting techniques to transform the data structure and make it suitable for a particular analysis or visualization task.

6. Group a DataFrame by a categorical column and calculate aggregate statistics such as the sum, count, or average for each group.

7. Perform data cleaning operations on a DataFrame, including handling missing values, removing duplicates, and dealing with inconsistent data formats.

8. Visualize data using Pandas and Matplotlib by creating various plots, such as line plots, bar plots, or scatter plots, to analyze trends or relationships.

9. Apply feature engineering techniques to a DataFrame by creating new columns based on existing ones, such as calculating age from a birthdate column or extracting month from a date column.

10. Conduct a time series analysis by manipulating and analyzing a DataFrame with dates and times, including plotting time series data, calculating rolling averages, or identifying seasonal patterns.

These assignment questions cover a range of topics and tasks to help you practice and reinforce your understanding of Pandas and data analysis techniques

# Conclusion:

In this chapter, we covered several important concepts related to data manipulation using Pandas. We started by learning how to access and manipulate DataFrame columns and rows, including techniques such as indexing, slicing, and conditional selection. We then explored various methods for filtering and selecting data based on specific criteria, allowing us to extract subsets of data for further analysis.

We also discussed how to handle missing data, which is a common challenge in data analysis. Pandas provides functionalities to identify and handle missing values, such as dropping rows or filling missing values with appropriate strategies.

Furthermore, we introduced the concepts of data sorting and ranking. Sorting data enables us to arrange the DataFrame in a specific order based on column values, allowing us to identify patterns or outliers. Ranking, on the other hand, assigns a rank to each value based on their order, providing a comparative measure within the dataset.

Additionally, we explored the basics of working with dates and times in Pandas. We learned how to represent dates using the datetime data type and how to work with date and time columns in a DataFrame. This is particularly useful when dealing with time series data or analyzing temporal trends.

By understanding these concepts and techniques, you have gained a solid foundation in data manipulation using Pandas. You are now equipped with the skills to access, filter, and manipulate data to extract meaningful insights and perform various data analysis tasks.

In the next chapter, we will delve deeper into data visualization using Pandas and explore different plotting techniques to effectively communicate and visualize data.