<b> Introduction to Pandas:</b>

Pandas is a powerful data manipulation and analysis library for Python. The two primary data structures in Pandas are:

<b>Series:</b> A one-dimensional labeled array capable of holding any data type. It is similar to a column in a table.

<b>DataFrame:</b> A two-dimensional labeled data structure with columns of potentially different types. It can be thought of as a table or a spreadsheet.

<b>Creating DataFrames and Series:</b>

You can create DataFrames and Series from various data sources such as lists, dictionaries, and CSV files.

In [2]:
import pandas as pd

# Creating a Series from a list
list_data = [10, 20, 30, 40, 50]
series = pd.Series(list_data)
print("Series from list:")
print(series)

# Creating a DataFrame from a dictionary
dict_data = {'Name': ['Alice', 'Bob', 'Charlie'],
             'Age': [25, 30, 35],
             'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(dict_data)
print("\nDataFrame from dictionary:")
print(df)

# Reading a CSV file into a DataFrame
# Uncomment the following lines if you have a CSV file to read
# df_csv = pd.read_csv('path_to_file.csv')
# print("\nDataFrame from CSV file:")
# print(df_csv)


Series from list:
0    10
1    20
2    30
3    40
4    50
dtype: int64

DataFrame from dictionary:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


<b>Common Operations:</b>

<b>Selecting Data:</b>

Selecting data refers to accessing specific rows or columns from a DataFrame. This can be done using labels (column names) or positions (row and column indices).

<b>Selecting Columns:</b> You can select one or more columns by passing the column name(s) in square brackets.

<b>Selecting Rows:</b> You can select rows using the loc and iloc accessors.

loc: Access a group of rows and columns by labels or a boolean array.

iloc: Access a group of rows and columns by integer position(s).

In [3]:
# Selecting a single column
print("\nSelecting the 'Name' column:")
print(df['Name'])

# Selecting multiple columns
print("\nSelecting the 'Name' and 'City' columns:")
print(df[['Name', 'City']])

# Selecting rows using loc
print("\nSelecting rows where index is 1 and 2 using loc:")
print(df.loc[1:2])

# Selecting rows using iloc
print("\nSelecting the first two rows using iloc:")
print(df.iloc[:2])



Selecting the 'Name' column:
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

Selecting the 'Name' and 'City' columns:
      Name         City
0    Alice     New York
1      Bob  Los Angeles
2  Charlie      Chicago

Selecting rows where index is 1 and 2 using loc:
      Name  Age         City
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Selecting the first two rows using iloc:
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles


<b>Filtering Rows:</b>

Filtering rows involves extracting rows that meet certain conditions. This is typically done using boolean indexing, where a boolean condition is applied to the DataFrame.

In [4]:
# Filtering rows where Age > 25
print("\nFiltering rows where Age > 25:")
print(df[df['Age'] > 25])

# Filtering rows based on multiple conditions
print("\nFiltering rows where Age > 25 and City is 'Chicago':")
print(df[(df['Age'] > 25) & (df['City'] == 'Chicago')])



Filtering rows where Age > 25:
      Name  Age         City
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Filtering rows where Age > 25 and City is 'Chicago':
      Name  Age     City
2  Charlie   35  Chicago


<b>Modifying Data:</b>

Modifying data involves changing the values within the DataFrame. This can include updating individual values, adding new columns, or applying functions to columns.

<b>Updating Values:</b> You can update individual values or entire columns using the loc or iloc accessors.

<b>Adding Columns:</b> New columns can be added by simply assigning a Series or a list to a new column name.

<b>Applying Functions:</b> Use the apply function to apply a function to each element in a column or row.

In [5]:
# Updating a single value
df.loc[1, 'City'] = 'San Francisco'
print("\nDataFrame after modifying a single value:")
print(df)

# Adding a new column
df['Country'] = 'USA'
print("\nDataFrame after adding a new column:")
print(df)

# Applying a function to a column
df['Age_plus_10'] = df['Age'].apply(lambda x: x + 10)
print("\nDataFrame after applying a function to the 'Age' column:")
print(df)



DataFrame after modifying a single value:
      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35        Chicago

DataFrame after adding a new column:
      Name  Age           City Country
0    Alice   25       New York     USA
1      Bob   30  San Francisco     USA
2  Charlie   35        Chicago     USA

DataFrame after applying a function to the 'Age' column:
      Name  Age           City Country  Age_plus_10
0    Alice   25       New York     USA           35
1      Bob   30  San Francisco     USA           40
2  Charlie   35        Chicago     USA           45


<b>Data Handling with Pandas:</b>

<b>Reading Data from Files</b>

Pandas can read data from various file formats, including CSV, Excel, and SQL databases. The most common format is CSV (Comma Separated Values), which stores tabular data in plain text.



In [7]:
# Reading a CSV file into a DataFrame
# Uncomment the following lines if you have a CSV file to read
# df_csv = pd.read_csv('path_to_file.csv')
# print("DataFrame from CSV file:")
# print(df_csv)


<b>Handling Missing Data:</b>

Handling missing data is a crucial step in data preprocessing. Missing data can cause issues in analysis and modeling. Pandas provides functions to handle missing values, remove duplicates, and convert data types.

<b>Causes of Missing Data:</b>
Missing data can occur due to various reasons, including:

Errors in data collection or entry.

Incomplete data sources.

Data corruption.

<b>Identifying Missing Data:</b>

You can identify missing data in a DataFrame using functions like isnull() and notnull(). These functions return a DataFrame of the same shape, but with Boolean values indicating the presence of missing values.

In [8]:
# Creating a DataFrame with missing values
df_missing = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6], 'C': [7, 8, 9]})
print("\nDataFrame with missing values:")
print(df_missing)

# Checking for missing values
print("\nChecking for missing values:")
print(df_missing.isnull())



DataFrame with missing values:
     A    B  C
0  1.0  4.0  7
1  2.0  NaN  8
2  NaN  6.0  9

Checking for missing values:
       A      B      C
0  False  False  False
1  False   True  False
2   True  False  False


<b>Handling Missing Values:</b>

There are several strategies for handling missing values:

<b>Removing Missing Data:</b> Use dropna() to remove rows or columns with missing values.

<b>Filling Missing Values:</b> Use fillna() to replace missing values with a specified value or method.

<b>Imputation:</b> Use statistical methods to estimate and fill missing values.

In [11]:
# Removing rows with missing values
df_missing_dropped = df_missing.dropna()
print("\nDataFrame after dropping missing values:")
print(df_missing_dropped)

# Filling missing values with a specified value
df_missing_filled = df_missing.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_missing_filled)

# Forward fill (propagate last valid observation forward)
df_missing_ffill = df_missing.fillna(method='ffill')
print("\nDataFrame after forward fill:")
print(df_missing_ffill)

# Backward fill (propagate next valid observation backward)
df_missing_bfill = df_missing.fillna(method='bfill')
print("\nDataFrame after backward fill:")
print(df_missing_bfill)



DataFrame after dropping missing values:
     A    B  C
0  1.0  4.0  7

DataFrame after filling missing values with 0:
     A    B  C
0  1.0  4.0  7
1  2.0  0.0  8
2  0.0  6.0  9

DataFrame after forward fill:
     A    B  C
0  1.0  4.0  7
1  2.0  4.0  8
2  2.0  6.0  9

DataFrame after backward fill:
     A    B  C
0  1.0  4.0  7
1  2.0  6.0  8
2  NaN  6.0  9


  df_missing_ffill = df_missing.fillna(method='ffill')
  df_missing_bfill = df_missing.fillna(method='bfill')


<b>Removing Duplicates:</b>

Duplicates can distort analysis and insights. Removing duplicates ensures data integrity.

In [12]:
# Creating a DataFrame with duplicates
df_duplicates = pd.DataFrame({'A': [1, 1, 2], 'B': [3, 3, 4]})
print("\nDataFrame with duplicates:")
print(df_duplicates)

# Removing duplicates
df_no_duplicates = df_duplicates.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)



DataFrame with duplicates:
   A  B
0  1  3
1  1  3
2  2  4

DataFrame after removing duplicates:
   A  B
0  1  3
2  2  4


<b>Data Type Conversion:</b>

Converting data types ensures consistency and allows for correct mathematical operations.

In [13]:
# Converting data types
df['Age'] = df['Age'].astype(float)
print("\nDataFrame after data type conversion:")
print(df)



DataFrame after data type conversion:
      Name   Age           City Country  Age_plus_10
0    Alice  25.0       New York     USA           35
1      Bob  30.0  San Francisco     USA           40
2  Charlie  35.0        Chicago     USA           45


<b>Data Analysis with Pandas:</b>

<b>Summary Statistics:</b>

Pandas can generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution. This includes functions like mean(), median(), std(), mode()

<b>Mean:</b> Average value of the data.

<b>Median:</b> Middle value of the data.

<b>Mode:</b> Most frequently occurring value in the data.

<b>Standard Deviation (std):</b> Measure of the amount of variation or dispersion.

<b>Variance:</b> Measure of how much the data varies from the mean.

In [14]:
print("\nSummary statistics of the original DataFrame:")
# Additional statistics
mean_age = df['Age'].mean()
median_age = df['Age'].median()
std_age = df['Age'].std()
variance_age = df['Age'].var()
mode_age = df['Age'].mode()

print("\nAdditional statistics:")
print(f"Mean Age: {mean_age}")
print(f"Median Age: {median_age}")
print(f"Standard Deviation of Age: {std_age}")
print(f"Variance of Age: {variance_age}")
print(f"Mode of Age: {mode_age.values}")



Summary statistics of the original DataFrame:

Additional statistics:
Mean Age: 30.0
Median Age: 30.0
Standard Deviation of Age: 5.0
Variance of Age: 25.0
Mode of Age: [25. 30. 35.]


<b>Grouping Data and Aggregate Functions:</b>

Grouping data and applying aggregate functions is a powerful feature in Pandas. It allows you to split your data into groups based on some criteria and then apply a function to each group. Common aggregate functions include mean(), sum(), count(), and max().

<b>Grouping Data:</b>
Grouping data involves splitting the data into subsets, applying a function to each subset, and combining the results into a DataFrame.

<b>Grouping by a Single Column:</b> Use the groupby() function to group data by a single column.

<b>Grouping by Multiple Columns:</b> Group data by multiple columns to get more detailed groupings.

In [15]:
# Grouping data by 'City' and calculating mean age
grouped_df = df.groupby('City')['Age'].mean()
print("\nMean age by city:")
print(grouped_df)

# Grouping by multiple columns and calculating sum
df_grouped = df.groupby(['City', 'Age']).sum()
print("\nGrouped DataFrame by 'City' and 'Age':")
print(df_grouped)

# Count the number of entries in each group
df_grouped_count = df.groupby('City').count()
print("\nCount of entries by city:")
print(df_grouped_count)



Mean age by city:
City
Chicago          35.0
New York         25.0
San Francisco    30.0
Name: Age, dtype: float64

Grouped DataFrame by 'City' and 'Age':
                       Name Country  Age_plus_10
City          Age                               
Chicago       35.0  Charlie     USA           45
New York      25.0    Alice     USA           35
San Francisco 30.0      Bob     USA           40

Count of entries by city:
               Name  Age  Country  Age_plus_10
City                                          
Chicago           1    1        1            1
New York          1    1        1            1
San Francisco     1    1        1            1


<b>Applying Aggregate Functions:</b>

Aggregate functions can be applied to grouped data to summarize and analyze it. These functions include:

<b>Mean (mean()):</b> Calculate the average value of each group.

<b>Sum (sum()):</b> Calculate the sum of each group.

<b>Count (count()):</b> Count the number of observations in each group.

<b>Maximum (max()):</b> Find the maximum value in each group.

<b>Minimum (min()):</b> Find the minimum value in each group.

<b>Standard Deviation (std()):</b> Calculate the standard deviation of each group.

<b>Variance (var()):</b> Calculate the variance of each group.

In [16]:
# Applying multiple aggregate functions
df_grouped_agg = df.groupby('City').agg({
    'Age': ['mean', 'max', 'min', 'std'],
    'Name': 'count'
})
print("\nGrouped DataFrame with multiple aggregate functions:")
print(df_grouped_agg)

# Renaming columns after aggregation
df_grouped_agg.columns = ['Mean Age', 'Max Age', 'Min Age', 'Age Std Dev', 'Name Count']
print("\nGrouped DataFrame with renamed columns:")
print(df_grouped_agg)



Grouped DataFrame with multiple aggregate functions:
                Age                  Name
               mean   max   min std count
City                                     
Chicago        35.0  35.0  35.0 NaN     1
New York       25.0  25.0  25.0 NaN     1
San Francisco  30.0  30.0  30.0 NaN     1

Grouped DataFrame with renamed columns:
               Mean Age  Max Age  Min Age  Age Std Dev  Name Count
City                                                              
Chicago            35.0     35.0     35.0          NaN           1
New York           25.0     25.0     25.0          NaN           1
San Francisco      30.0     30.0     30.0          NaN           1


<b>Advanced Data Manipulation:</b>

Advanced data manipulation techniques in Pandas include merging, joining, and concatenating DataFrames. These operations allow you to combine multiple DataFrames into a single DataFrame.

<b>Merging:</b> Combine two DataFrames based on a key column using pd.merge().

<b>Joining:</b> Join DataFrames using their indexes with join().

<b>Concatenating:</b> Append DataFrames vertically or horizontally using pd.concat().

<b>Merging DataFrames:</b>
Merging is similar to SQL joins and is used to combine DataFrames on a key column.

In [17]:
# Merging DataFrames
df1 = pd.DataFrame({'Key': ['A', 'B', 'C'], 'Value1': [1, 2, 3]})
df2 = pd.DataFrame({'Key': ['A', 'B', 'D'], 'Value2': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='Key', how='inner')
print("\nMerged DataFrame (inner join):")
print(merged_df)

# Outer merge
merged_df_outer = pd.merge(df1, df2, on='Key', how='outer')
print("\nMerged DataFrame (outer join):")
print(merged_df_outer)



Merged DataFrame (inner join):
  Key  Value1  Value2
0   A       1       4
1   B       2       5

Merged DataFrame (outer join):
  Key  Value1  Value2
0   A     1.0     4.0
1   B     2.0     5.0
2   C     3.0     NaN
3   D     NaN     6.0


<b>Concatenating DataFrames:</b>

Concatenation is used to append DataFrames either vertically (axis=0) or horizontally (axis=1).

In [18]:
# Concatenating DataFrames vertically
concat_df_vertical = pd.concat([df1, df2], axis=0)
print("\nConcatenated DataFrame (vertical):")
print(concat_df_vertical)

# Concatenating DataFrames horizontally
concat_df_horizontal = pd.concat([df1, df2], axis=1)
print("\nConcatenated DataFrame (horizontal):")
print(concat_df_horizontal)



Concatenated DataFrame (vertical):
  Key  Value1  Value2
0   A     1.0     NaN
1   B     2.0     NaN
2   C     3.0     NaN
0   A     NaN     4.0
1   B     NaN     5.0
2   D     NaN     6.0

Concatenated DataFrame (horizontal):
  Key  Value1 Key  Value2
0   A       1   A       4
1   B       2   B       5
2   C       3   D       6


<b>Application in Data Science:</b>

<b>Advantages of Using Pandas:</b>
Pandas is a powerful and flexible data manipulation library that offers several advantages over traditional Python data structures:

<b>Ease of Use:</b> Intuitive API and powerful functions for data manipulation and analysis.

<b>Performance:</b> Built on top of NumPy, providing fast and efficient operations on large datasets.

<b>Data Handling:</b> Ability to handle missing data, duplicates, and data type conversions.

<b>Integration:</b> Easily integrates with other data science libraries such as NumPy, Matplotlib, and Scikit-learn.

<b>Real-World Applications:</b>

Pandas is essential in various real-world data science applications, including:

<b>Data Cleaning:</b> Preprocessing raw data to remove inconsistencies and prepare it for analysis.

<b>Exploratory Data Analysis (EDA):</b> Summarizing and visualizing data to uncover patterns and insights.

<b>Data Transformation:</b> Reshaping and transforming data for analysis and modeling.

<b>Time Series Analysis:</b> Handling and analyzing time series data for trends and patterns.

Example: Data Cleaning:

Pandas is widely used for data cleaning, which involves handling missing values, removing duplicates, and correcting data types.

In [19]:
# Example DataFrame with messy data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob'],
        'Age': [25, None, 35, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', None]}
df_messy = pd.DataFrame(data)

# Handling missing values
df_messy['Age'] = df_messy['Age'].fillna(df_messy['Age'].mean())
df_messy['City'] = df_messy['City'].fillna('Unknown')

# Removing duplicates
df_clean = df_messy.drop_duplicates()

print("\nCleaned DataFrame:")
print(df_clean)



Cleaned DataFrame:
      Name   Age         City
0    Alice  25.0     New York
1      Bob  30.0  Los Angeles
2  Charlie  35.0      Chicago
3      Bob  30.0      Unknown


Example: Exploratory Data Analysis (EDA):

EDA involves generating summary statistics and visualizing data to uncover patterns and insights.

In [20]:
# Generating summary statistics
print("\nSummary statistics of the cleaned DataFrame:")
print(df_clean.describe())

# Visualizing data (requires Matplotlib or Seaborn)
# import matplotlib.pyplot as plt
# df_clean['Age'].plot(kind='hist', title='Age Distribution')
# plt.show()



Summary statistics of the cleaned DataFrame:
             Age
count   4.000000
mean   30.000000
std     4.082483
min    25.000000
25%    28.750000
50%    30.000000
75%    31.250000
max    35.000000
