# PANDAS

GETTING FAMILIAR WITH PANDAS :
  -

1. UNDERSTANDING DATAFRAMES AND SERIES :

   By exploring Pandas' basic functionalities—starting with understanding DataFrames and Series, creating them from various data sources, and practicing common operations—you'll gain a solid foundation in using this powerful library. This knowledge will serve as the building blocks for more advanced data manipulation and analysis tasks in your data science journey.
  - Install pandas :

In [7]:
pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


- Import pandas :

In [2]:
import pandas as pd

 - Series:

A Pandas Series is a one-dimensional array-like object that can hold data of any type (integers, floats, strings, etc.). Each element in a Series has a label, also known as an index.
Think of a Series as a column in a spreadsheet.

In [3]:
# Creating a Series
data = [10, 20, 30, 40]
s = pd.Series(data)
print(s)

0    10
1    20
2    30
3    40
dtype: int64


- DataFrame:
  
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It’s similar to a table in a relational database or an Excel spreadsheet.

In [13]:
# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
dfe = pd.DataFrame(data)
print(dfe)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


2. CREATING DATAFRAMES AND SERIES FROM VARIOUS DATA SOURCES :
   
- From Lists :

In [14]:
# Series from a list
data = [5, 10, 15, 20]
s = pd.Series(data)
print(s)
# DataFrame from lists
data = {'Column1': [1, 2, 3], 'Column2': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)

0     5
1    10
2    15
3    20
dtype: int64
   Column1  Column2
0        1        4
1        2        5
2        3        6


- From Dictionaries :

In [15]:
# Series from a dictionary
data = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(data)
print(f"series : {s}")
# DataFrame from a dictionary
data = {
    'Product': ['A', 'B', 'C'],
    'Price': [10.99, 15.49, 7.99]
}
df = pd.DataFrame(data)
print(f"dataframe: {df}")

series : a    1
b    2
c    3
dtype: int64
dataframe:   Product  Price
0       A  10.99
1       B  15.49
2       C   7.99


- From CSV Files :

In [16]:
# Reading a CSV file into a DataFrame
df = pd.read_csv('students.csv')
print(df)

  Name Branch Section
0   S1    CSD       A
1   S2    CSE       C
2   S3   MECH       B


3. COMMON OPERATIONS :
   - Selecting Data :
      1. Select a single column :

In [17]:
print(df['Name'])  # Returns a Series

0    S1
1    S2
2    S3
Name: Name, dtype: object


         B. Select multiple columns :

In [18]:
print(df[['Name', 'Branch']])  # Returns a DataFrame

  Name Branch
0   S1    CSD
1   S2    CSE
2   S3   MECH


  - FILTERING ROWS :
      Filter rows based on a condition :

In [19]:
# Filter rows where Age is greater than 25
filtered_df = dfe[dfe['Age'] > 25]
print(filtered_df)

      Name  Age         City
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


 - MODIFYING DATA :

In [21]:
# Add a new column :
dfe['Salary'] = [50000, 60000, 55000]
# Modify existing data :
dfe['Age'] = dfe['Age'] + 1  # Increase all ages by 1
# Drop a column :
dfe = dfe.drop('Salary', axis=1)  # Remove the 'Salary' column
print(dfe)

      Name  Age         City
0    Alice   26     New York
1      Bob   31  Los Angeles
2  Charlie   36      Chicago


DATA HANDLING WITH PANDAS :
 -

Data handling using Pandas, focuses on tasks such as reading data from files, handling missing data, and transforming data. The program also includes steps to clean and preprocess the data, including handling missing values, removing duplicates, and performing data type conversions.

Reading Data from a CSV File :
 - The program reads data from a CSV file ('data.csv') into a Pandas DataFrame using 'pd.read_csv()'.
 - The 'print(df)' statement displays the original DataFrame.

Handling Missing Data :
 - The program checks for missing values using 'df.isnull().sum()'.
 - Missing values in the Age column are filled with the mean age using 'fillna()'.
 - Missing values in the Salary column are filled with the median salary.

Removing Duplicates :
 - The program removes duplicate rows from the DataFrame using 'drop_duplicates()'.

Data Type Conversions :
 - The Age column is converted to integers using 'astype(int)'.
 - The Salary column is converted to floats using 'astype(float)'.

Data Transformation :
 - A new column, Salary in Thousands, is created by dividing the Salary column by 1000.

Saving the Cleaned Data :
 - The cleaned DataFrame is saved to a new CSV file ('cleaned_data.csv') using 'to_csv()'.

In [33]:
# step 1: Read the CSV file into a DataFrame
d= pd.read_csv('info.csv')

# Display the DataFrame
print("Original DataFrame:")
print(d)

# Step 2: Handling Missing Data

# Check for missing values
print("\nMissing Values:")
print(d.isnull().sum())

# Fill missing values in the 'Age' column with the mean age
d['Age'].fillna(d['Age'].mean(), inplace=True)

# Fill missing values in the 'Salary' column with the median salary
d['Salary'].fillna(d['Salary'].median(), inplace=True)

print("\nDataFrame after handling missing values:")
print(d)

# Step 3: Removing Duplicates

# Remove duplicate rows
d.drop_duplicates(inplace=True)

print("\nDataFrame after removing duplicates:")
print(d)

# Step 4: Data Type Conversions

# Convert the 'Age' column to integers
d['Age'] = d['Age'].astype(int)

# Convert the 'Salary' column to floats
d['Salary'] = d['Salary'].astype(float)

print("\nDataFrame after data type conversions:")
print(d)

# Step 5: Data Transformation

# Create a new column 'Salary in Thousands' by dividing the 'Salary' column by 1000
d['Salary in Thousands'] = d['Salary'] / 1000

print("\nDataFrame after data transformation:")
print(d)

# Step 6: Save the cleaned DataFrame to a new CSV file
d.to_csv('cleaned_info.csv', index=False)
print("\nCleaned data saved to 'cleaned_info.csv'")

Original DataFrame:
      Name   Age   Salary Department
0    Alice  25.0  50000.0         HR
1      Bob  30.0      NaN    Finance
2  Charlie  35.0  70000.0         IT
3    David   NaN  60000.0  Marketing
4     Emma  28.0  55000.0         HR
5      Bob  30.0  65000.0    Finance
6    Frank  40.0  80000.0         IT

Missing Values:
Name          0
Age           1
Salary        1
Department    0
dtype: int64

DataFrame after handling missing values:
      Name        Age   Salary Department
0    Alice  25.000000  50000.0         HR
1      Bob  30.000000  62500.0    Finance
2  Charlie  35.000000  70000.0         IT
3    David  31.333333  60000.0  Marketing
4     Emma  28.000000  55000.0         HR
5      Bob  30.000000  65000.0    Finance
6    Frank  40.000000  80000.0         IT

DataFrame after removing duplicates:
      Name        Age   Salary Department
0    Alice  25.000000  50000.0         HR
1      Bob  30.000000  62500.0    Finance
2  Charlie  35.000000  70000.0         IT
3    D

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  d['Age'].fillna(d['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  d['Salary'].fillna(d['Salary'].median(), inplace=True)


DATA ANALYSIS WITH PANDAS :
 -

Data analysis with Pandas, focuses on generating summary statistics, grouping data, and applying aggregate functions. It also cover advanced data manipulation techniques like merging, joining, and concatenating DataFrames.

1. Creating a DataFrame :
A DataFrame df is created using sample data containing employee names, departments, ages, and salaries.(which we have already learnt in above topics)
2. Generating Summary Statistics :
The describe() function generates summary statistics for numeric columns, such as count, mean, standard deviation, min, and max.

3. Grouping Data and Applying Aggregate Functions :
 - The groupby() function groups the data by the 'Department' column and applies aggregate functions (e.g., mean, sum) to the 'Salary' column.
 - Multiple aggregate functions are applied to both 'Salary' and 'Age' columns.
4. Advanced Data Manipulation:
 - Merging DataFrames: Combines df and df2 on the 'Name' column.
 - Joining DataFrames: Joins df and df2 using the 'Name' index.
 - Concatenating DataFrames: Stacks df and df3 vertically.

In [35]:
# 1: Sample Data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace'],
    'Department': ['HR', 'Finance', 'IT', 'Marketing', 'HR', 'IT', 'Finance'],
    'Age': [25, 30, 35, 40, 28, 40, 30],
    'Salary': [50000, 60000, 70000, 60000, 55000, 80000, 62000]
}
df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# 2: Generate Summary Statistics
print("\nSummary Statistics:")
print(df.describe())

# 3: Grouping Data and Applying Aggregate Functions
# Group by 'Department' and calculate the mean and sum of 'Salary'
grouped = df.groupby('Department')['Salary'].agg(['mean', 'sum'])
print("\nGrouped Data (Mean and Sum of Salary by Department):")
print(grouped)

# Group by 'Department' and calculate multiple aggregates for 'Salary' and 'Age'
multi_grouped = df.groupby('Department').agg({
    'Salary': ['mean', 'sum', 'max'],
    'Age': ['mean', 'min', 'max']
})
print("\nMultiple Aggregates for 'Salary' and 'Age' by Department:")
print(multi_grouped)

# 4: Advanced Data Manipulation
# Merging, Joining, and Concatenating DataFrames

# Creating a second DataFrame for demonstration
data2 = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Grace', 'Henry'],
    'Bonus': [5000, 6000, 7000, 6000, 5500, 6200, 5800]
}
df2 = pd.DataFrame(data2)

# Merging DataFrames on the 'Name' column
merged_df = pd.merge(df, df2, on='Name')
print("\nMerged DataFrame (on 'Name'):")
print(merged_df)

# Joining DataFrames
# Assume df2 is an additional data about bonuses, using 'Name' as an index
df2.set_index('Name', inplace=True)
joined_df = df.join(df2, on='Name')
print("\nJoined DataFrame (on 'Name'):")
print(joined_df)

# Concatenating DataFrames (stacking them vertically)
data3 = {
    'Name': ['Isaac', 'James'],
    'Department': ['IT', 'Finance'],
    'Age': [29, 31],
    'Salary': [68000, 64000]
}
df3 = pd.DataFrame(data3)
concatenated_df = pd.concat([df, df3], ignore_index=True)
print("\nConcatenated DataFrame (df + df3):")
print(concatenated_df)

Original DataFrame:
      Name Department  Age  Salary
0    Alice         HR   25   50000
1      Bob    Finance   30   60000
2  Charlie         IT   35   70000
3    David  Marketing   40   60000
4     Emma         HR   28   55000
5    Frank         IT   40   80000
6    Grace    Finance   30   62000

Summary Statistics:
             Age        Salary
count   7.000000      7.000000
mean   32.571429  62428.571429
std     5.883795   9897.089519
min    25.000000  50000.000000
25%    29.000000  57500.000000
50%    30.000000  60000.000000
75%    37.500000  66000.000000
max    40.000000  80000.000000

Grouped Data (Mean and Sum of Salary by Department):
               mean     sum
Department                 
Finance     61000.0  122000
HR          52500.0  105000
IT          75000.0  150000
Marketing   60000.0   60000

Multiple Aggregates for 'Salary' and 'Age' by Department:
             Salary                  Age        
               mean     sum    max  mean min max
Department           

APPLICATION IN DATA SCIENCE :
 -

Pandas is used to perform a variety of essential data handling and analysis tasks, showcasing its ability to efficiently manipulate and analyze structured data. The following points highlight how the use of Pandas can significantly benefit a data science professional:

- ADVANTAGES OF USING PANDAS :
 1. Efficient Data Handling :
  - DataFrames and Series : Pandas' primary data structures, DataFrames and Series, are optimized for handling large datasets efficiently. They provide powerful tools for data manipulation, such as indexing, slicing, filtering, and reshaping, which are not as straightforward or efficient with traditional Python data structures like lists and dictionaries.
 2. Ease of Data Cleaning :
  - Handling Missing Data : Pandas offers simple and intuitive methods for handling missing data, such as fillna() and dropna(). These functions make it easy to prepare data for analysis by filling in or removing incomplete data.
  - Data Type Conversions : Converting data types is seamless with Pandas, ensuring that the data is in the correct format for analysis. Functions like astype() allow for easy conversions that are critical in data preprocessing.
 3. Advanced Data Manipulation :
  - Merging, Joining, and Concatenating : Pandas provides robust functions to merge, join, and concatenate datasets, enabling data scientists to combine multiple sources of data with ease. These operations are crucial when dealing with real-world datasets that often come from various sources and need to be integrated.
 4. Comprehensive Data Analysis :
  - Summary Statistics and Aggregation : Pandas simplifies the process of generating summary statistics and performing group-based aggregation, allowing data scientists to quickly gain insights into their data. Functions like groupby() and agg() facilitate complex analyses that would be cumbersome with basic Python structures.
  - Exploratory Data Analysis (EDA) : Pandas is indispensable for EDA, providing tools to quickly explore datasets, understand distributions, and identify trends. The ability to easily slice and dice the data enables a data scientist to uncover patterns and relationships that inform further analysis.
    
- REAL-WORLD EXAMPLES OF PANDAS IN ACTION :
 1. Data Cleaning :
In the financial industry, datasets often come with missing values, duplicates, or incorrect formats. Pandas is essential for cleaning these datasets, ensuring that they are ready for accurate analysis and modeling.
 2. Exploratory Data Analysis (EDA) :
In marketing, EDA is used to understand customer behavior and segment markets. Pandas allows for quick aggregation and visualization of customer data, helping marketers identify key customer segments and tailor campaigns accordingly.
 3. Merging Diverse Datasets :
In scientific research, data often comes from multiple experiments or studies. Pandas enables researchers to merge these datasets efficiently, ensuring that all relevant information is combined into a cohesive dataset for analysis.

- CONCLUSION :

Pandas stands out as a vital tool for data science professionals due to its ability to handle complex data operations with ease and efficiency. Compared to traditional Python data structures, Pandas offers a more powerful and flexible approach to data manipulation and analysis. Whether it's cleaning data, performing exploratory data analysis, or merging datasets, Pandas empowers data scientists to work more effectively, leading to faster and more accurate insights. This efficiency is crucial in real-world applications where time and accuracy are paramount.
