# Pandas

## 1.Getting familier with pandas

Pandas is a powerful library in Python that provides flexible and efficient data structures for data manipulation and analysis. The two primary data structures in Pandas are DataFrames and Series. Understanding these structures and how to manipulate them is essential for any data science professional.

In [1]:
#importing pandas package
import pandas as pd

### Understanding dataframes and series

**Series:** A Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). It is similar to a column in an Excel spreadsheet or a single column in a database table. Each element in a Series is associated with an index.

**DataFrame:** A DataFrame is a two-dimensional labeled data structure, similar to a table in a database or an Excel spreadsheet. It consists of rows and columns, where each column can be of a different data type.

In [2]:
# Creating a Series from a list
data = [2,14,41,55,24,44]
series = pd.Series(data)
print(series)

0     2
1    14
2    41
3    55
4    24
5    44
dtype: int64


In [8]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Indu', 'Bindu', 'Meghana','Vinu'],
    'Age': [56, 60, 19,45],
    'Native': ['Alamanda', 'Srikakulam', 'Vizag','Bheemli']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Native
0,Indu,56,Alamanda
1,Bindu,60,Srikakulam
2,Meghana,19,Vizag
3,Vinu,45,Bheemli


In [7]:
# Creating a DataFrame from a list of lists
data = [['Indu', 56], ['Bindu', 60], ['Meghana', 19],['Vinuthna',45]]
df2 = pd.DataFrame(data, columns=['Name', 'Age'])
df2

Unnamed: 0,Name,Age
0,Indu,56
1,Bindu,60
2,Meghana,19
3,Vinuthna,45


In [12]:
# Reading a DataFrame from a CSV file
df3 = pd.read_csv('student.csv')

In [13]:
df3

Unnamed: 0,Name,Rollno,Age
0,meghana,2,20
1,indu,41,19
2,bindu,55,19


### Common Operations

#### Selecting data

In [14]:
#selecting columns

#single column
df['Name']

0       Indu
1      Bindu
2    Meghana
3       Vinu
Name: Name, dtype: object

In [15]:
#multiple columns
df[['Name','Age']]

Unnamed: 0,Name,Age
0,Indu,56
1,Bindu,60
2,Meghana,19
3,Vinu,45


#### Selecting Rows by Index

In [17]:
#using iloc
#select first row
df.iloc[0]

Name          Indu
Age             56
Native    Alamanda
Name: 0, dtype: object

In [18]:
#multiple rows
df.iloc[0:2]

Unnamed: 0,Name,Age,Native
0,Indu,56,Alamanda
1,Bindu,60,Srikakulam


#### Selecting Rows by Condition

In [19]:
df[df['Age']>40]

Unnamed: 0,Name,Age,Native
0,Indu,56,Alamanda
1,Bindu,60,Srikakulam
3,Vinu,45,Bheemli


### Filtering Rows

Using Conditions

In [22]:
filtered_df = df[df['Native'] == 'Srikakulam']
filtered_df

Unnamed: 0,Name,Age,Native
1,Bindu,60,Srikakulam


In [23]:
(df['Age'] > 25) & (df['Native'] == 'Bheemli')

0    False
1    False
2    False
3     True
dtype: bool

In [24]:
df[(df['Age'] > 25) & (df['Native'] == 'Bheemli')]

Unnamed: 0,Name,Age,Native
3,Vinu,45,Bheemli


### Modifying data within dataframes

In [25]:
# Add a new column
df['Salary'] = [50000, 60000, 70000,100000]

In [26]:
df

Unnamed: 0,Name,Age,Native,Salary
0,Indu,56,Alamanda,50000
1,Bindu,60,Srikakulam,60000
2,Meghana,19,Vizag,70000
3,Vinu,45,Bheemli,100000


In [27]:
# Modify an existing column
df['Age'] = df['Age'] + 10
df

Unnamed: 0,Name,Age,Native,Salary
0,Indu,66,Alamanda,50000
1,Bindu,70,Srikakulam,60000
2,Meghana,29,Vizag,70000
3,Vinu,55,Bheemli,100000


In [28]:
# Drop a column
df = df.drop('Salary', axis=1)
df


Unnamed: 0,Name,Age,Native
0,Indu,66,Alamanda
1,Bindu,70,Srikakulam
2,Meghana,29,Vizag
3,Vinu,55,Bheemli


In [29]:
# Drop a row by index
df = df.drop(0, axis=0)
df

Unnamed: 0,Name,Age,Native
1,Bindu,70,Srikakulam
2,Meghana,29,Vizag
3,Vinu,55,Bheemli


## 2.Data Handling with pandas

In [30]:
import numpy as np
#creating a data frame with null values using numpy
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], 
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [32]:
# Check for missing values in the DataFrame
data.isnull()

Unnamed: 0,0,1,2
0,False,False,False
1,False,True,True
2,True,True,True
3,True,False,False


In [35]:
#number of null values
data.isnull().sum()

0    2
1    2
2    2
dtype: int64

#### Filling Missing values

In [36]:
#filling all the missing values with 1
data.fillna(0)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,3.0


In [39]:
#filling with mean
data.fillna(data.mean())

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,6.5,3.0
2,1.0,6.5,3.0
3,1.0,6.5,3.0


In [41]:
#different fill value for each column
data.fillna({0:0.5,1:0,2:1.2})

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,0.0,1.2
2,0.5,0.0,1.2
3,0.5,6.5,3.0


#### Dropping rows and columns

In [43]:
# Dropping rows where all elements are missing
data.dropna(how='all', inplace=True)
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [44]:
# Dropping rows where atleast one element is missing
data.dropna(how='any', inplace=True)
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


####  Transforming Data

In [51]:
df['salary'] = [50000, 60000, 70000]
df

Unnamed: 0,Name,Age,Native,salary
1,Bindu,70,Srikakulam,50000
2,Meghana,29,Vizag,60000
3,Vinu,55,Bheemli,70000


In [52]:
# Adding a new column 'Annual Salary' calculated from the 'Salary' column (assuming 'Salary' is monthly)
df['Annual Salary'] = df['salary'] * 12
df

Unnamed: 0,Name,Age,Native,salary,Annual Salary
1,Bindu,70,Srikakulam,50000,600000
2,Meghana,29,Vizag,60000,720000
3,Vinu,55,Bheemli,70000,840000


## 3.Data Analysis with Pandas

In [53]:
#considering a part of ipl data frame
df1 = pd.DataFrame({
    'Player_ID': [1, 2, 3, 4],
    'Player_Name': ['Virat Kohli', 'Rohit Sharma', 'MS Dhoni', 'KL Rahul'],
    'Team': ['RCB', 'MI', 'CSK', 'LSG'],
    'Runs': [670, 540, 450, 600],
    'Matches': [14, 14, 12, 13]
})
df1

Unnamed: 0,Player_ID,Player_Name,Team,Runs,Matches
0,1,Virat Kohli,RCB,670,14
1,2,Rohit Sharma,MI,540,14
2,3,MS Dhoni,CSK,450,12
3,4,KL Rahul,LSG,600,13


In [55]:
df2 = pd.DataFrame({
    'Player_ID': [3, 4, 5, 6],
    'Player_Name': ['MS Dhoni', 'KL Rahul', 'David Warner', 'AB de Villiers'],
    'Team': ['CSK', 'LSG', 'SRH', 'RCB'],
    'Wickets': [5, 0, 3, 2],
    'Matches': [12, 13, 14, 11]
})
df2

Unnamed: 0,Player_ID,Player_Name,Team,Wickets,Matches
0,3,MS Dhoni,CSK,5,12
1,4,KL Rahul,LSG,0,13
2,5,David Warner,SRH,3,14
3,6,AB de Villiers,RCB,2,11


In [57]:
#description of dataset using statistical measures
df1.describe()

Unnamed: 0,Player_ID,Runs,Matches
count,4.0,4.0,4.0
mean,2.5,565.0,13.25
std,1.290994,93.273791,0.957427
min,1.0,450.0,12.0
25%,1.75,517.5,12.75
50%,2.5,570.0,13.5
75%,3.25,617.5,14.0
max,4.0,670.0,14.0


In [60]:
#grouping the data and applying aggregate functions
# Group by 'Team' and calculate the mean runs scored
grouped_runs = df1.groupby('Team')['Runs'].mean()
grouped_runs

Team
CSK    450.0
LSG    600.0
MI     540.0
RCB    670.0
Name: Runs, dtype: float64

In [59]:
# Group by 'Team' and get the count of players in each team
grouped_count = df1.groupby('Team')['Player_Name'].count()
grouped_count

Team
CSK    1
LSG    1
MI     1
RCB    1
Name: Player_Name, dtype: int64

#### Merging Dataframes

In [61]:
#since df1 and df2 have common column 'Player_ID' we can merge them
merged_df = pd.merge(df1, df2, on='Player_ID', suffixes=('_df1', '_df2'))
merged_df

Unnamed: 0,Player_ID,Player_Name_df1,Team_df1,Runs,Matches_df1,Player_Name_df2,Team_df2,Wickets,Matches_df2
0,3,MS Dhoni,CSK,450,12,MS Dhoni,CSK,5,12
1,4,KL Rahul,LSG,600,13,KL Rahul,LSG,0,13


#### Joining dataframes

In [62]:
# Set 'Player_ID' as index for joining
df1.set_index('Player_ID', inplace=True)
df2.set_index('Player_ID', inplace=True)

In [64]:
# Join DataFrames on index (Player_ID)
joined_df = df1.join(df2, lsuffix='_df1', rsuffix='_df2', how='inner')
joined_df

Unnamed: 0_level_0,Player_Name_df1,Team_df1,Runs,Matches_df1,Player_Name_df2,Team_df2,Wickets,Matches_df2
Player_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,MS Dhoni,CSK,450,12,MS Dhoni,CSK,5,12
4,KL Rahul,LSG,600,13,KL Rahul,LSG,0,13


#### Concatenating dataframes

In [65]:
# Resetting index to original for concatenation
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)

In [68]:
# Concatenate DataFrames vertically
pd.concat([df1, df2], axis=0)

Unnamed: 0,Player_ID,Player_Name,Team,Runs,Matches,Wickets
0,1,Virat Kohli,RCB,670.0,14,
1,2,Rohit Sharma,MI,540.0,14,
2,3,MS Dhoni,CSK,450.0,12,
3,4,KL Rahul,LSG,600.0,13,
0,3,MS Dhoni,CSK,,12,5.0
1,4,KL Rahul,LSG,,13,0.0
2,5,David Warner,SRH,,14,3.0
3,6,AB de Villiers,RCB,,11,2.0


In [67]:
# Concatenate DataFrames horizontally
pd.concat([df1, df2], axis=1)

Unnamed: 0,Player_ID,Player_Name,Team,Runs,Matches,Player_ID.1,Player_Name.1,Team.1,Wickets,Matches.1
0,1,Virat Kohli,RCB,670,14,3,MS Dhoni,CSK,5,12
1,2,Rohit Sharma,MI,540,14,4,KL Rahul,LSG,0,13
2,3,MS Dhoni,CSK,450,12,5,David Warner,SRH,3,14
3,4,KL Rahul,LSG,600,13,6,AB de Villiers,RCB,2,11


### Conclusion

The use of Pandas in the program demonstrates the powerful capabilities this library offers for data handling and analysis, making it an indispensable tool for data science professionals. Pandas excels in several key areas compared to traditional Python data structures such as lists, dictionaries, and arrays.

## Advantages of Using Pandas Over Traditional Python Data Structures

**Efficient Data Handling**:
- **DataFrames and Series**: Pandas offers intuitive and flexible data structures that simplify the management and analysis of tabular data, akin to database tables or Excel sheets.
- **Column and Row Operations**: Pandas makes it easy to perform operations on entire columns or rows, simplifying data transformation and aggregation.

**Data Analysis and Aggregation**:
- **Built-in Functions**: Pandas provides a wide range of functions for generating summary statistics, grouping data, and applying aggregate functions with minimal code.
- **Groupby and Aggregation**: Grouping data and performing aggregations is seamless, making it easier to summarize data and extract insights.

**Advanced Data Manipulation**:
- **Merging, Joining, and Concatenating**: Pandas offers powerful tools for combining datasets, enabling data scientists to integrate and analyze data from multiple sources.
- **Data Cleaning and Preprocessing**: Pandas simplifies tasks such as handling missing data, removing duplicates, and converting data types, preparing datasets for analysis.

**Performance and Scalability**:
- **Optimized for Performance**: Built on NumPy, Pandas is highly efficient and handles large datasets better than traditional Python data structures.
- **Scalability**: Pandas continues to perform well as data sizes grow, thanks to its optimized underlying implementations.

**Ease of Use and Readability**:
- **Intuitive Syntax**: Pandas is user-friendly, improving productivity and reducing errors.
- **Integration with Other Libraries**: Pandas seamlessly integrates with libraries like Matplotlib and Scikit-learn, making it a versatile tool for data science tasks.

## How Pandas Benefits Data Science Professionals

For data science professionals, Pandas is a game-changer. It allows them to quickly and efficiently perform a wide range of data operations, from loading and cleaning data to complex analyses and transformations. By leveraging Pandas, data scientists can focus more on deriving insights and building models rather than getting bogged down in manual data manipulation.

### Real-World Examples Where Pandas is Essential

1. **Data Cleaning in Financial Analysis**:
   - **Example**: A financial analyst working with stock market data often deals with missing or inconsistent data. Pandas is essential for cleaning this data by filling missing values, removing duplicates, and ensuring that all data points are in a consistent format.
   - **Use Case**: A company might use Pandas to clean daily stock prices from multiple sources, ensuring that the data is accurate and ready for further analysis, such as calculating moving averages or identifying trends.

2. **Exploratory Data Analysis (EDA) in Healthcare**:
   - **Example**: In healthcare, analysts often explore patient data to identify patterns or correlations between different variables (e.g., age, blood pressure, cholesterol levels). Pandas allows for quick and efficient EDA by summarizing data, generating statistics, and visualizing distributions.
   - **Use Case**: A data scientist might use Pandas to explore a dataset containing patient health records, identifying key factors that correlate with the likelihood of developing heart disease, which can then be used to inform preventive care strategies.

3. **Data Integration and Transformation in Marketing**:
   - **Example**: Marketers often need to combine data from various sources, such as customer databases, website analytics, and social media platforms. Pandas is crucial for merging, joining, and transforming these datasets into a unified format for analysis.
   - **Use Case**: A marketing team could use Pandas to merge customer demographic data with purchase history and website activity, enabling them to perform segmentation analysis and tailor marketing campaigns to specific customer groups.

4. **Time Series Analysis in Environmental Science**:
   - **Example**: Environmental scientists often work with time series data, such as temperature records or air quality measurements over time. Pandas excels in handling time series data, allowing for easy resampling, rolling statistics, and trend analysis.
   - **Use Case**: A researcher might use Pandas to analyze a dataset of daily temperature readings over several decades, identifying long-term trends and seasonal patterns that could indicate climate change.

5. **Data Preparation for Machine Learning in Retail**:
   - **Example**: In retail, machine learning models are used for predicting customer behavior, such as forecasting sales or recommending products. Pandas is essential for preparing the data, including feature engineering, normalization, and splitting data into training and testing sets.
   - **Use Case**: A data scientist could use Pandas to preprocess a retail dataset containing customer purchase history, transforming categorical data into numerical features and normalizing the data before feeding it into a machine learning model for predictig future
yry data analysis and beyond.

## Summary

Pandas greatly benefits data science professionals by providing powerful, intuitive data structures like DataFrames and Series that simplify the handling and analysis of complex, large datasets. It streamlines essential tasks such as data cleaning, transformation, and aggregation, enabling quick identification and resolution of issues like missing data or inconsistencies. Pandas also integrates seamlessly with other data science tools, making it easier to perform exploratory data analysis, prepare data for machine learning, and generate insights with minimal code. This efficiency and flexibility make Pandas an indispensable tool for data science workflows, enhancing productivity and accuracy.