# GETTING FAMILIARITY WITH PANDAS

Pandas is an open-source data manipulation and analysis library for Python. It provides high-performance, easy-to-use data structures and data analysis tools. The two primary data structures in Pandas are Series (for one-dimensional data) and DataFrame (for two-dimensional data).

### Key Features of Pandas:

#### Data Structures:

##### 1. Series: 
A one-dimensional labeled array that can hold data of any type (integers, strings, floats, etc.).
##### 2. DataFrame: 
A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It's similar to a table in a relational database or an Excel spreadsheet.

#### Data Handling:

##### 1. Loading Data: 
You can easily load data from different file formats such as CSV, Excel, SQL databases, JSON, etc.
##### 2. Data Cleaning: 
Pandas provides powerful tools to handle missing data, duplicate data, and incorrect formats, making it easier to clean and prepare data for analysis.
##### 3. Data Transformation: 
With Pandas, you can manipulate data by merging, reshaping, slicing, and dicing the DataFrame to fit your needs.
##### 4. Data Aggregation: 
You can perform operations like grouping, summarizing, and aggregating data to gain insights.

#### Time Series:
Pandas has strong support for working with time series data, including functionality for date-based indexing, resampling, and time-based operations.

#### Integration:
Pandas integrates well with other data science libraries in Python, such as NumPy for numerical operations and Matplotlib or Seaborn for data visualization.

### Installing pandas package

To install the Pandas package, you can use Python's package management tools. Pandas is available on the Python Package Index (PyPI), so it can be installed using pip or through a package manager like conda if you're using Anaconda.

In [58]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


### 1. Understanding Series

##### Series:
A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, etc.). The labels in a Series are referred to as the index.
Think of a Series as a column in an Excel sheet.

In [59]:
import pandas

In [60]:
#creating Series from a list
data = [10, 20, 30, 40]
series = pandas.Series(data)
print(series)

0    10
1    20
2    30
3    40
dtype: int64


In [61]:
#creating Series from a Dictionary
data={'a':10,'b':20,'c':30,'d':40,'e':50}
series=pandas.Series(data)
print(series)

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [62]:
#creating Series from scalar
series=pandas.Series(5,index=['a','b','c','d'])
print(series)

a    5
b    5
c    5
d    5
dtype: int64


### 2. Understanding Dataframes

 ##### DataFrame:
  A Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a table in a database, an Excel sheet, or a spreadsheet.
  Each column in a DataFrame is a Series.

In [63]:
#creating Dataframe from a list of lists
data = [[1, 'Alice'], [2, 'Bob'], [3, 'Charlie']]
df = pandas.DataFrame(data, columns=['ID', 'Name'])
print(df)

   ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie


In [64]:
#creating Dataframes form dictionary of lists
data = {
      'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'City': ['New York', 'Los Angeles', 'Chicago']
  }
df = pandas.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [65]:
#creating Dataframes from dictionary of dictionaries
data = {
    'row1': {'ID': 1, 'Name': 'Alice'},
    'row2': {'ID': 2, 'Name': 'Bob'},
    'row3': {'ID': 3, 'Name': 'Charlie'}
}
df = pandas.DataFrame(data).T
print(df)

     ID     Name
row1  1    Alice
row2  2      Bob
row3  3  Charlie


In [66]:
#Reading from csv file
df=pandas.read_csv("data_file.csv")
print(df)

  name   age  branch
0    X    19     CSD
1    Y    20     CSD
2    Z    19     CSD


# Writing a Program in Python to Handle and Analyze data using Pandas.

### Data Handling and Analysis:

Pandas provides a powerful and flexible set of tools for data handling and analyzing. Its core data structures (Series and DataFrame) and a comprehensive set of functions allow users to handle data efficiently. Understanding these concepts and techniques will enable effective data wrangling, cleaning, and analysis in various data science and data analysis tasks.

### Key Features:

##### 1. Data Structures: 
Pandas provides two primary data structures:
##### Series: 
A one-dimensional labeled array of values.

##### DataFrame: 
A two-dimensional labeled data structure with columns of potentially different types.

##### 2. Data Operations: 
Pandas supports various data operations, including:
##### Filtering: 
Selecting specific rows or columns based on conditions.
##### Sorting: 
Sorting data by one or more columns.
##### Grouping: 
Grouping data by one or more columns and performing aggregation operations.
##### Merging: 
Combining data from multiple DataFrames.
##### Reshaping: 
Transforming data from wide to long format or vice versa.

##### 3. Data Input/Output: 
Pandas supports reading and writing data from various file formats, including:
##### CSV: 
Comma-separated values.
##### Excel: 
Microsoft Excel files.
##### JSON: 
JavaScript Object Notation.
##### SQL: 
Structured Query Language databases.

let us consider the following program how to handle and analysis the data using pandas on students marks dataset.

This program will demonstrate data handling using Pandas, focusing on reading data from a file, handling missing data, transforming data, and performing data analysis, including summary statistics, grouping, and advanced data manipulation techniques like merging, joining, and concatenating DataFrames.


In [100]:
#importing pandas package as pd for our convience
import pandas as pd

In [104]:
# Sample dataset creating 
data = {
    'Roll No': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward', 'Frank', 'Grace', 'Hannah', 'Ivy', 'Jack'],
    'Math': [85, 76, 89, None, 91, 78, 88, 95, 70, 84],
    'Science': [92, 85, 94, 70, 85, None, 90, 98, 75, 88],
    'English': [78, 80, 92, 72, 87, 84, None, 99, 73, 85],
    'History': [90, 88, 95, 68, 93, 79, 92, 100, 78, 87],
    'Art': [89, 92, 91, 74, 90, 85, 87, 97, 72, 90]
}

In [105]:
# Create a DataFrame from the data
df = pd.DataFrame(data)

In [106]:
print("Original DataFrame:")
print(df)

Original DataFrame:
   Roll No     Name  Math  Science  English  History  Art
0      101    Alice  85.0     92.0     78.0       90   89
1      102      Bob  76.0     85.0     80.0       88   92
2      103  Charlie  89.0     94.0     92.0       95   91
3      104    David   NaN     70.0     72.0       68   74
4      105   Edward  91.0     85.0     87.0       93   90
5      106    Frank  78.0      NaN     84.0       79   85
6      107    Grace  88.0     90.0      NaN       92   87
7      108   Hannah  95.0     98.0     99.0      100   97
8      109      Ivy  70.0     75.0     73.0       78   72
9      110     Jack  84.0     88.0     85.0       87   90


#### Data Handling

##### 1. Handling Missing Data

In [107]:
# a. Identifying Missing Data
print("\nMissing Data Information:")
print(df.isnull().sum())


Missing Data Information:
Roll No    0
Name       0
Math       1
Science    1
English    1
History    0
Art        0
dtype: int64


In [108]:
#converting string column to numeric column for better computation
df['Name']=pd.factorize(df['Name'])[0]
print(df['Name'])

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
Name: Name, dtype: int64


In [109]:
# b. Filling Missing Values
# Fill missing values in numeric columns with the mean of the respective column
df.fillna(df.mean(), inplace=True)
print("\nDataFrame After Filling Missing Data with Mean:")
print(df)


DataFrame After Filling Missing Data with Mean:
   Roll No  Name  Math    Science    English  History  Art
0      101     0  85.0  92.000000  78.000000       90   89
1      102     1  76.0  85.000000  80.000000       88   92
2      103     2  89.0  94.000000  92.000000       95   91
3      104     3  84.0  70.000000  72.000000       68   74
4      105     4  91.0  85.000000  87.000000       93   90
5      106     5  78.0  86.333333  84.000000       79   85
6      107     6  88.0  90.000000  83.333333       92   87
7      108     7  95.0  98.000000  99.000000      100   97
8      109     8  70.0  75.000000  73.000000       78   72
9      110     9  84.0  88.000000  85.000000       87   90


##### 2. Removing Duplicates

In [110]:
# (In this example, there are no duplicates, but the function is shown for demonstration)
df.drop_duplicates(inplace=True)

print("\nDataFrame After Removing Duplicates:")
print(df)


DataFrame After Removing Duplicates:
   Roll No  Name  Math    Science    English  History  Art
0      101     0  85.0  92.000000  78.000000       90   89
1      102     1  76.0  85.000000  80.000000       88   92
2      103     2  89.0  94.000000  92.000000       95   91
3      104     3  84.0  70.000000  72.000000       68   74
4      105     4  91.0  85.000000  87.000000       93   90
5      106     5  78.0  86.333333  84.000000       79   85
6      107     6  88.0  90.000000  83.333333       92   87
7      108     7  95.0  98.000000  99.000000      100   97
8      109     8  70.0  75.000000  73.000000       78   72
9      110     9  84.0  88.000000  85.000000       87   90


##### 3. Data Type Conversions

In [113]:
# Convert 'Math' scores to integers (if they were floats after filling NaN)
df['Math'] = df['Math'].astype(int)
df['Science'] = df['Science'].astype(int)
df['English'] = df['English'].astype(int)

print("\nDataFrame After Data Type Conversions:")
print(df.dtypes)
print(df)


DataFrame After Data Type Conversions:
Roll No    int64
Name       int64
Math       int32
Science    int32
English    int32
History    int64
Art        int64
dtype: object
   Roll No  Name  Math  Science  English  History  Art
0      101     0    85       92       78       90   89
1      102     1    76       85       80       88   92
2      103     2    89       94       92       95   91
3      104     3    84       70       72       68   74
4      105     4    91       85       87       93   90
5      106     5    78       86       84       79   85
6      107     6    88       90       83       92   87
7      108     7    95       98       99      100   97
8      109     8    70       75       73       78   72
9      110     9    84       88       85       87   90


##### 4. Data Transformation

In [114]:
# Add a new column 'Total Marks' as the sum of marks across subjects
df['Total Marks'] = df[['Math', 'Science', 'English', 'History', 'Art']].sum(axis=1)

print("\nDataFrame After Adding 'Total Marks' Column:")
print(df)


DataFrame After Adding 'Total Marks' Column:
   Roll No  Name  Math  Science  English  History  Art  Total Marks
0      101     0    85       92       78       90   89          434
1      102     1    76       85       80       88   92          421
2      103     2    89       94       92       95   91          461
3      104     3    84       70       72       68   74          368
4      105     4    91       85       87       93   90          446
5      106     5    78       86       84       79   85          412
6      107     6    88       90       83       92   87          440
7      108     7    95       98       99      100   97          489
8      109     8    70       75       73       78   72          368
9      110     9    84       88       85       87   90          434


In [115]:
# Add a new column 'Average Marks' as the mean of marks across subjects
df['Average Marks'] = df[['Math', 'Science', 'English', 'History', 'Art']].mean(axis=1)

print("\nDataFrame After Adding 'Average Marks' Column:")
print(df)


DataFrame After Adding 'Average Marks' Column:
   Roll No  Name  Math  Science  English  History  Art  Total Marks  \
0      101     0    85       92       78       90   89          434   
1      102     1    76       85       80       88   92          421   
2      103     2    89       94       92       95   91          461   
3      104     3    84       70       72       68   74          368   
4      105     4    91       85       87       93   90          446   
5      106     5    78       86       84       79   85          412   
6      107     6    88       90       83       92   87          440   
7      108     7    95       98       99      100   97          489   
8      109     8    70       75       73       78   72          368   
9      110     9    84       88       85       87   90          434   

   Average Marks  
0           86.8  
1           84.2  
2           92.2  
3           73.6  
4           89.2  
5           82.4  
6           88.0  
7           97.8  

#### Data Analysis

In [116]:
# 1. Generating Summary Statistics
print("\nSummary Statistics for the DataFrame:")
print(df.describe())


Summary Statistics for the DataFrame:
         Roll No      Name       Math    Science    English     History  \
count   10.00000  10.00000  10.000000  10.000000  10.000000   10.000000   
mean   105.50000   4.50000  84.000000  86.300000  83.300000   87.000000   
std      3.02765   3.02765   7.512952   8.446564   8.246885    9.486833   
min    101.00000   0.00000  70.000000  70.000000  72.000000   68.000000   
25%    103.25000   2.25000  79.500000  85.000000  78.500000   81.000000   
50%    105.50000   4.50000  84.500000  87.000000  83.500000   89.000000   
75%    107.75000   6.75000  88.750000  91.500000  86.500000   92.750000   
max    110.00000   9.00000  95.000000  98.000000  99.000000  100.000000   

             Art  Total Marks  Average Marks  
count  10.000000    10.000000      10.000000  
mean   86.700000   427.300000      85.460000  
std     7.888811    37.786094       7.557219  
min    72.000000   368.000000      73.600000  
25%    85.500000   414.250000      82.850000  
50%

In [117]:
# 2. Grouping Data and Applying Aggregate Functions
# Group by 'Average Marks' category (e.g., High, Medium, Low based on thresholds)
def categorize_avg_marks(avg_marks):
    if avg_marks >= 90:
        return 'High'
    elif avg_marks >= 75:
        return 'Medium'
    else:
        return 'Low'

In [122]:
df['Performance Category'] = df['Average Marks'].apply(categorize_avg_marks)
print(df)

   Roll No  Name  Math  Science  English  History  Art  Total Marks  \
0      101     0    85       92       78       90   89          434   
1      102     1    76       85       80       88   92          421   
2      103     2    89       94       92       95   91          461   
3      104     3    84       70       72       68   74          368   
4      105     4    91       85       87       93   90          446   
5      106     5    78       86       84       79   85          412   
6      107     6    88       90       83       92   87          440   
7      108     7    95       98       99      100   97          489   
8      109     8    70       75       73       78   72          368   
9      110     9    84       88       85       87   90          434   

   Average Marks Performance Category  
0           86.8               Medium  
1           84.2               Medium  
2           92.2                 High  
3           73.6                  Low  
4           89.2  

##### 3. Advanced Data Manipulation

In [132]:
# Sort the DataFrame based on 'Total Marks' in descending order
sorted_df = df.sort_values(by='Total Marks', ascending=False)

print("DataFrame Sorted by Total Marks (Descending):")
print(sorted_df)


DataFrame Sorted by Total Marks (Descending):
   Roll No  Name  Math  Science  English  History  Art  Total Marks  \
7      108     7    95       98       99      100   97          489   
2      103     2    89       94       92       95   91          461   
4      105     4    91       85       87       93   90          446   
6      107     6    88       90       83       92   87          440   
0      101     0    85       92       78       90   89          434   
9      110     9    84       88       85       87   90          434   
1      102     1    76       85       80       88   92          421   
5      106     5    78       86       84       79   85          412   
3      104     3    84       70       72       68   74          368   
8      109     8    70       75       73       78   72          368   

   Average Marks Performance Category  
7           97.8                 High  
2           92.2                 High  
4           89.2               Medium  
6           

In [136]:
#printing top three students scored highest marks
print(sorted_df.head(3))

   Roll No  Name  Math  Science  English  History  Art  Total Marks  \
7      108     7    95       98       99      100   97          489   
2      103     2    89       94       92       95   91          461   
4      105     4    91       85       87       93   90          446   

   Average Marks Performance Category  
7           97.8                 High  
2           92.2                 High  
4           89.2               Medium  


In [133]:
# Suppose you have another DataFrame with additional student information
extra_data = {
    'Roll No': [101, 102, 103, 110],
    'Parent Name': ['Mr. Smith', 'Mrs. Doe', 'Mr. Johnson', 'Ms. White'],
    'Contact': ['123-456', '789-012', '345-678', '901-234']
}
extra_df = pd.DataFrame(extra_data)
print(extra_df)

   Roll No  Parent Name  Contact
0      101    Mr. Smith  123-456
1      102     Mrs. Doe  789-012
2      103  Mr. Johnson  345-678
3      110    Ms. White  901-234


In [134]:
#  Merging DataFrames on 'Roll No'
merged_df = pd.merge(df, extra_df, on='Roll No', how='left')

print("\nMerged DataFrame with Additional Student Information:")
print(merged_df)


Merged DataFrame with Additional Student Information:
   Roll No  Name  Math  Science  English  History  Art  Total Marks  \
0      101     0    85       92       78       90   89          434   
1      102     1    76       85       80       88   92          421   
2      103     2    89       94       92       95   91          461   
3      104     3    84       70       72       68   74          368   
4      105     4    91       85       87       93   90          446   
5      106     5    78       86       84       79   85          412   
6      107     6    88       90       83       92   87          440   
7      108     7    95       98       99      100   97          489   
8      109     8    70       75       73       78   72          368   
9      110     9    84       88       85       87   90          434   

   Average Marks Performance Category  Parent Name  Contact  
0           86.8               Medium    Mr. Smith  123-456  
1           84.2               Medium  

In [135]:
# Joining DataFrames (similar to SQL join)
# Another DataFrame simulating exam scores from a different term
term2_data = {
    'Roll No': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'Math Term 2': [88, 79, 90, 72, 92, 80, 89, 97, 73, 85],
    'Science Term 2': [91, 87, 95, 74, 86, 81, 91, 99, 77, 89]
}
term2_df = pd.DataFrame(term2_data)

In [131]:
# Join the term2_df with the original df based on 'Roll No'
joined_df = df.set_index('Roll No').join(term2_df.set_index('Roll No'), how='left')

print("\nJoined DataFrame with Term 2 Scores:")
print(joined_df)


Joined DataFrame with Term 2 Scores:
         Name  Math  Science  English  History  Art  Total Marks  \
Roll No                                                            
101         0    85       92       78       90   89          434   
102         1    76       85       80       88   92          421   
103         2    89       94       92       95   91          461   
104         3    84       70       72       68   74          368   
105         4    91       85       87       93   90          446   
106         5    78       86       84       79   85          412   
107         6    88       90       83       92   87          440   
108         7    95       98       99      100   97          489   
109         8    70       75       73       78   72          368   
110         9    84       88       85       87   90          434   

         Average Marks Performance Category  Math Term 2  Science Term 2  
Roll No                                                               

### Conclusion

The use of Pandas in this program demonstrates several key advantages that make it invaluable for data science professionals, particularly in the areas of data handling and analysis.

#### Advantages of Pandas Over Traditional Python Data Structures:

1. **Efficient Data Handling**:
   - **DataFrames and Series**: Pandas provides two primary data structures, DataFrames and Series, which are specifically designed for handling and analyzing structured data. Unlike traditional Python lists or dictionaries, DataFrames allow for easy manipulation of tabular data, including operations on rows and columns.
   - **Vectorized Operations**: Pandas supports vectorized operations, which means that operations can be performed on entire columns or rows at once, significantly speeding up computations compared to iterating through data manually using traditional loops.

2. **Data Cleaning and Preprocessing**:
   - **Missing Data**: Pandas provides convenient methods for handling missing data, such as `fillna()` to fill missing values and `dropna()` to remove them. These methods streamline the process of preparing data for analysis.
   - **Duplicates and Data Type Conversion**: The ability to easily identify and remove duplicates with `drop_duplicates()` and convert data types with `astype()` simplifies data preprocessing tasks that would otherwise require manual handling.

3. **Data Transformation and Aggregation**:
   - **Aggregation Functions**: Pandas includes powerful aggregation functions like `groupby()` and `agg()` that facilitate the summarization of data. For example, you can group data by categories and calculate mean values, which is essential for data analysis and reporting.
   - **Adding and Modifying Columns**: Adding new columns or modifying existing ones based on calculations (e.g., calculating total and average marks) is straightforward with Pandas, enabling quick and flexible data transformations.

4. **Advanced Data Manipulation**:
   - **Merging and Joining**: Pandas offers functionalities like `merge()`, `concat()`, and `join()` to combine multiple DataFrames based on common columns or indices. This is crucial for integrating data from different sources.
   - **Handling Different Data Formats**: Pandas can read and write data in various formats such as CSV, Excel, and SQL databases, making it versatile for different data sources.

#### Real-World Examples Where Pandas Is Essential:

1. **Data Cleaning**:
   - **Finance**: In financial data analysis, Pandas is used to clean and preprocess large datasets, such as transaction records, by handling missing values, filtering out outliers, and converting data types for accurate analysis.
   - **Healthcare**: Pandas helps in managing patient records, where missing data might need to be imputed or cleaned, and data needs to be aggregated for reporting and analysis.

2. **Exploratory Data Analysis (EDA)**:
   - **Marketing**: In marketing analytics, Pandas is used to perform EDA by summarizing data, calculating key metrics, and visualizing trends to understand customer behavior and campaign performance.
   - **Scientific Research**: Pandas is widely used in scientific research for analyzing experimental data, performing statistical analysis, and preparing data for visualization.

3. **Data Integration**:
   - **E-commerce**: For e-commerce platforms, Pandas can merge customer data with transaction history from different sources to create comprehensive customer profiles and analyze purchasing patterns.
   - **Education**: In educational institutions, Pandas can combine data from various assessments and demographics to analyze student performance and outcomes.

Overall, Pandas offers a robust and flexible framework for handling, analyzing, and visualizing data efficiently, making it a cornerstone tool for data science professionals.