## Introduction to Pandas

Pandas is a powerful Python **library** for data manipulation and analysis.

**_Pandas_** is an open-source Python library designed for data manipulation and analysis.
It provides powerful, flexible data structures—primarily the one-dimensional Series and two-dimensional DataFrame—which make it easy **to work** with structured data such as spreadsheets, SQL tables, or CSV files.

It supports automatic alignment, missing data handling, and rich data manipulation functions.

##### _Pandas provides a convenient way to ***analyze and clean*** data._

##### _The Pandas library introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy._

Pandas offers a wide range of functions for analyzing, cleaning, exploring, and transforming data. Common tasks include handling missing values, filtering and merging datasets, grouping and summarizing data, and preparing data for visualization or machine learning. It is especially valued in data science for its ability to efficiently process **large datasets** and streamline repetitive data-wrangling tasks

#### Analogy (Expanded):

        - Think of a Pandas DataFrame like an Excel spreadsheet in Python.

        - It has rows and columns, labels, and allows you to perform operations like sorting, filtering, and calculations — but with the full power and speed of Python and NumPy behind it.

### What is Pandas Used for?

Pandas is a powerful library generally used for:

        - Data Cleaning
        - Data Transformation
        - Data Analysis
        - Machine Learning
        - Data Visualization

### Why Use Pandas?

Some of the reasons why we should use Pandas are as follows:

1. Handle Large Data Efficiently

Pandas is designed for handling large datasets. It provides powerful tools that simplify tasks like data filtering, transforming, and merging.

It also provides built-in functions to work with formats like CSV, JSON, TXT, Excel, and SQL databases.

2. Tabular Data Representation

Pandas DataFrames, the primary data structure of Pandas, handle data in tabular format. This allows easy indexing, selecting, replacing, and slicing of data.

3. Data Cleaning and Preprocessing

Data cleaning and preprocessing are essential steps in the data analysis pipeline, and Pandas provides powerful tools to facilitate these tasks. It has methods for handling missing values, removing duplicates, handling outliers, data normalization, etc.

4. Time Series Functionality

Pandas contains an extensive set of tools for working with dates, times, and time-indexed data as it was initially developed for financial modeling.

5. Free and Open-Source

Pandas follows the same principles as Python, allowing you to use and distribute Pandas for free, even for commercial use.


#### Import Pandas in Python

We can import Pandas in Python using the import statement.


In [1]:
# This code imports the pandas library into our program with the alias pd.
import pandas as pd

After this import statement, we can use Pandas functions and objects by calling them with pd.

##### **Notes:**

        - If we import pandas without an alias using import pandas, we can create a DataFrame using the pandas.DataFrame() function.

        - Using an alias pd is a common convention among Python programmers, as it makes it easier and quicker to refer to the pandas library in your code.


#### Main **_Data Structures_** of Pandas

Pandas, a popular Python library for data manipulation and analysis, is built around **two** primary data structures:

- Series and
- DataFrame.

##### **Series**

A Series is a **_one-dimensional_** labeled array capable of holding **data** of **any** type (integer, string, float, etc.).

Each element in a Series has an associated label, called an index, which allows for fast and flexible data access and manipulation.

You can think of a Series as similar to a **_single column_** in a spreadsheet or a database table.

It consists of **_two_** main **components**: the labels and the data.
For example,

            0    'John'
            1    30
            2    6.2
            3    False
            dtype: object

The **labels** are the **index** values assigned to each data point, while the **data** represents the actual **values** stored in the Series.

The **_labels_** in the Pandas Series are index numbers by default. Like in dataframe and array, the index number in series starts **_from_** 0.

Such labels can be used to access a specified value.

**_Note:_** Pandas Series can store elements of different data types. It uses a concept called dtype (data type) to manage and represent the underlying data in a Series.

### Creating a Pandas Series

We can create a Series from lists, NumPy arrays, or dictionaries:

#### From a Python list:


In [2]:
import pandas as pd

data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


By default, the index is a range starting from 0.

#### From a NumPy array:


In [3]:
import numpy as np

data = np.array(['python', 'php', 'java'])
series = pd.Series(data)
print(series)

0    python
1       php
2      java
dtype: object


#### With a **_custom_** index:


In [None]:
s2 = pd.Series(data=['python', 'php', 'java'], index=['r1', 'r2', 'r3'])
print(s2)

r1    python
r2       php
r3      java
dtype: object


#### From a dictionary:

Notice that the **_keys_** of the dictionary have become the labels.


In [4]:
data = {'a': 100, 'b': 200, 'c': 300}
series = pd.Series(data)
print(series)

a    100
b    200
c    300
dtype: int64


##### Key Attributes of a Series

**index**: The labels of the Series.

**values**: The underlying data as a NumPy array.

**dtype**: The data type of the Series.

**shape**: The shape (number of elements).

**size**: Total number of elements.

**name**: Name of the Series (optional).

**ndim**: Number of dimensions (always 1 for Series)

###### Example:


In [6]:
print(series.index)   # RangeIndex(start=0, stop=5, step=1)
print(series.values)  # array([100, 200, 300])
print(series.dtype)   # int64
print(series.shape)   # (3,)
print(series.size)    # 3

Index(['a', 'b', 'c'], dtype='object')
[100 200 300]
int64
(3,)
3


##### Accessing Data in a Series

By position (integer index):


In [9]:
import pandas as pd

data = [10, 20, 30, 40, 50]
series = pd.Series(data)

print(series[0])  # 10

10


By label (if custom index):


In [10]:
s2 = pd.Series(data=['python', 'php', 'java'], index=['r1', 'r2', 'r3'])

print(s2['r1'])   # python

python


Slicing:


In [1]:
import pandas as pd

data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series[1:4])  # 1    20
# 2    30
# 3    40

1    20
2    30
3    40
dtype: int64


##### **DataFrame**

A DataFrame is a **_two-dimensional_**, size-mutable, and potentially **heterogeneous** tabular data structure.

For example,

          Country      Capital      Population
     0    Canada       Ottawa       37742154
     1    Australia    Canberra     25499884
     2    UK           London       67886011
     3    Brazil       Brasília     212559417

Here,

Country, Capital and Population are the **_column names_**.

Each row represents a record, with the index value on the left. The index values are auto-assigned starting from **_0_**.

Each column contains data of the **_same_** type. For instance, Country and Capital contain strings, and Population contains integers.

It consists of an ordered collection of columns, each of which can be a different data type (numeric, string, boolean, etc.).

DataFrames are **_analogous_** to **spreadsheets** in Excel or SQL **tables**, with both row and column indices.
It is designed to manage ordered and unordered datasets in Python.

Each column in a DataFrame is essentially a Series, and the DataFrame organizes these Series into a table-like structure.

#### Key Characteristics of DataFrame

- **Two-dimensional**: Data is organized in rows and columns.

- **Labeled axes**: Both rows (index) and columns can have labels.

- **Heterogeneous data**: Different columns can hold different data types.

- Built on top of NumPy for performance.

- Handles missing data gracefully.

- Supports a wide range of data manipulation and analysis operations.

- Creating a DataFrame


#### Create a Pandas DataFrame

There are multiple ways to create a DataFrame in pandas:
We can create a Pandas DataFrame in the following ways:

- From a Dictionary of Lists
- From a List of Lists
- From a List of Dictionaries
- From a NumPy Array
- From Series
- Reading from CSV or Excel (DataFrame From a File)
- Create an Empty DataFrame


1. From a Dictionary **_of_** Lists


In [None]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'Krish', 'Jack'],
        'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)
print(df)


data = {"name": ["abdi", "chala", "dagoo"],
        "age": [20, 21, 19],
        "city": ["addis", "dire", "hararaddis"]}
df = pd.DataFrame(data, index=["r1", "r2", "r3"])
print(df)

    Name  Age
0    Tom   20
1   Nick   21
2  Krish   19
3   Jack   18
     name  age   city
r1   abdi   20  addis
r2  chala   21   dire
r3  dagoo   19  harar


2. From a List of Lists


In [17]:
data = [['Tom', 10], ['Nick', 15], ['Juli', 14]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

   Name  Age
0   Tom   10
1  Nick   15
2  Juli   14


3. From a List of Dictionaries


In [22]:
data = [{'a': 1, 'b': 2}, {'a': 10, 'b': 20, 'b': 30}]
df = pd.DataFrame(data)
print(df)

    a   b
0   1   2
1  10  30


4. From a NumPy Array


In [None]:
import numpy as np

data = np.array([['Alice', 25], ['Bob', 30]])
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

    Name Age
0  Alice  25
1    Bob  30


5. From Series


In [24]:
courses = pd.Series(['Spark', 'Pandas'])
fees = pd.Series([20000, 25000])
duration = pd.Series(['30days', '40days'])

df = pd.concat({'Courses': courses, 'Course_Fee': fees,
               'Course_Duration': duration}, axis=1)
print(df)

  Courses  Course_Fee Course_Duration
0   Spark       20000          30days
1  Pandas       25000          40days


6. Reading from CSV or Excel

Another common way to create a DataFrame is by **_loading_** data from a CSV (**_Comma-Separated Values_**) file.

Refers to the process of **_importing and analyzing_** data that is stored in a CSV (Comma-Separated Values) file.

For example,


In [29]:
df = pd.read_csv('data.csv')  # Read CSV file
''''
df = pd.read_excel('data.xlsx')  # Read Excel file
df = pd.read_json('data.json')  # Read JSON file
df = pd.read_html('data.html')  # Read HTML file
df = pd.read_sql('SELECT * FROM table_name', con)  # Read SQL file
df = pd.read_sql_table('table_name', con)  # Read SQL table
df = pd.read_sql_query('SELECT * FROM table_name', con)  # Read SQL query
'''

"'\ndf = pd.read_excel('data.xlsx')  # Read Excel file\ndf = pd.read_json('data.json')  # Read JSON file\ndf = pd.read_html('data.html')  # Read HTML file\ndf = pd.read_sql('SELECT * FROM table_name', con)  # Read SQL file\ndf = pd.read_sql_table('table_name', con)  # Read SQL table\ndf = pd.read_sql_query('SELECT * FROM table_name', con)  # Read SQL query\n"

In [None]:

import pandas as pd

# load data from a CSV file
df = pd.read_csv('data.csv')

print(df)

           Car       Model  Volume  Weight  CO2
0       Toyoty        Aygo    1000     790   99
1   Mitsubishi  Space Star    1200    1160   95
2        Skoda      Citigo    1000     929   95
3         Fiat         500     900     865   90
4         Mini      Cooper    1500    1140  105
5           VW         Up!    1000     929  105
6        Skoda       Fabia    1400    1109   90
7     Mercedes     A-Class    1500    1365   92
8         Ford      Fiesta    1500    1112   98
9         Audi          A1    1600    1150   99
10     Hyundai         I20    1100     980   99
11      Suzuki       Swift    1300     990  101
12        Ford      Fiesta    1000    1112   99
13       Honda       Civic    1600    1252   94
14      Hundai         I30    1600    1326   97
15        Opel       Astra    1600    1330   97
16         BMW           1    1600    1365   99
17       Mazda           3    2200    1280  104
18       Skoda       Rapid    1600    1119  104
19        Ford       Focus    2000    13

In this example, we used the read*csv() \*\*\_function*\*\* which reads the CSV file data.csv, and automatically creates a DataFrame object df, containing data from the CSV file.


In this example, we used the **_read_csv()_** function which reads the CSV file data.csv, and automatically creates a DataFrame object df, containing data from the CSV file.


7.Create an Empty DataFrame


In [32]:
import pandas as pd

# create an empty DataFrame
df = pd.DataFrame()

print(df)

'''In this example, we have created an empty DataFrame by calling pd.DataFrame() without any arguments.
Here, both the Columns and Index lists are empty in the DataFrame.The DataFrame has no data, but it can be used as a container to store and manipulate data later.
'''

Empty DataFrame
Columns: []
Index: []


'In this example, we have created an empty DataFrame by calling pd.DataFrame() without any arguments.\nHere, both the Columns and Index lists are empty in the DataFrame.The DataFrame has no data, but it can be used as a container to store and manipulate data later.\n'

**_Additional Notes_**
Both Series and DataFrame support a wide range of data types, including numeric, boolean, string (object), categorical, and datetime types.

In summary, pandas’ core data structures—Series (1D) and DataFrame (2D)—enable efficient and flexible handling of labeled data, making them essential tools for data science and analytics in Python.


### **Useful Inspect Methods in Pandas DataFrame**

Inspecting your data is a crucial first step in any data analysis workflow.

Pandas provides several built-in **_methods and attributes_** to help you quickly understand the structure, content, and quality of your DataFrame.

Here are the most useful inspect methods, with examples:

1.  **head() and tail()**

- Purpose: **_View_** the first or last few rows of the DataFrame.

- Usage:


In [None]:
df.head()      # First 5 rows by default
# df.head(10)    # First 10 rows
# df.tail()      # Last 5 rows by default
# df.tail(3)     # Last 3 rows

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyoty,Aygo,1000,790,99
1,Mitsubishi,Space Star,1200,1160,95
2,Skoda,Citigo,1000,929,95
3,Fiat,500,900,865,90
4,Mini,Cooper,1500,1140,105


- Why: Quickly check data samples, column names, and spot obvious issues.

2. **shape Attribute**

- Purpose: Get the dimensions of the DataFrame (rows, columns).

- Usage:


In [None]:
df.shape
# Output: (number_of_rows, number_of_columns)

(36, 5)

In [None]:
import pandas as pd

# Example 1:
data1 = {'col1': [1, 2], 'col2': [3, 4]}
df1 = pd.DataFrame(data1)
print(df1)
print(df1.shape)

# Example 2:
data2 = {'col1': [1, 2, 3, 4, 5]}
df2 = pd.DataFrame(data2)
print(df2)
print(df2.shape)

# Example 3: Empty DataFrame
df3 = pd.DataFrame()
print(df3)
print(df3.shape)

   col1  col2
0     1     3
1     2     4
(2, 2)
   col1
0     1
1     2
2     3
3     4
4     5
(5, 1)
Empty DataFrame
Columns: []
Index: []
(0, 0)


- Why: Know the size of your dataset instantly


3. **info()**

- Purpose: Summary of the DataFrame, including index dtype, column dtypes, non-null counts, and memory usage.

- Usage:


In [45]:
df.info()
# Display information about the DataFrame, including the number of non-null values and data types of each column.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Car     36 non-null     object
 1   Model   36 non-null     object
 2   Volume  36 non-null     int64 
 3   Weight  36 non-null     int64 
 4   CO2     36 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 1.5+ KB


- Why: Essential for detecting missing data, understanding data types, and getting a quick overview


4. **describe()**

- Purpose: Generate descriptive statistics for numeric (and optionally, categorical) columns.

- Usage:


In [None]:
df.describe()                      # Numeric columns summary
# df.describe(include='object')      # Categorical columns summary
# df.describe(include='all')         # All columns
# df.describe(include='number')      # Numeric columns summary
# df.describe(include='float')       # Float columns summary

Unnamed: 0,Volume,Weight,CO2
count,36.0,36.0,36.0
mean,1611.111111,1292.277778,102.027778
std,388.975047,242.123889,7.454571
min,900.0,790.0,90.0
25%,1475.0,1117.25,97.75
50%,1600.0,1329.0,99.0
75%,2000.0,1418.25,105.0
max,2500.0,1746.0,120.0


- Why: Quickly see count, mean, std, min, max, and quartiles for numeric data; unique values, top value, and frequency for categorical data.


5. **dtypes Attribute**

- Purpose: List the data types of each column.

- Usage:


In [None]:
df.dtypes

Series([], dtype: object)

- Why: Check if columns have the expected types (e.g., numeric, object, datetime)


6. **columns,index and values Attributes**

- Purpose: Get the list of column names and row indices.

- Usage:


In [None]:
df.columns    # Column names
df.index      # Row index labels
df.values     # Data as a NumPy array

- Why: Useful for referencing or renaming columns and understanding how your data is indexed


7. **isnull() and sum()**

- Purpose: Detect missing values in the DataFrame.

- Usage:


In [30]:
df.isnull()            # DataFrame of True/False for missing values
df.isnull().sum()      # Count of missing values per column
df.notnull()           # DataFrame of True/False for non-missing values
df.notnull().sum()     # Count of non-missing values per column

Series([], dtype: float64)

- Why: Identify columns with missing data for cleaning or imputation.

8. **unique(), nunique(), and value_counts()**

- Purpose: Analyze categorical data.

- Usage:


In [None]:
df['column'].unique()         # Array of unique values
df['column'].nunique()        # Number of unique values
df['column'].value_counts()   # Frequency of each value
df['column'].value_counts(normalize=True)  # Relative frequency of each value
df['column'].value_counts().sort_index()  # Sort by index

- Why: Explore the distribution and diversity of categorical columns.

#### Example Workflow


In [None]:
import pandas as pd

# Load your data
df = pd.read_csv('data.csv')

# Inspect the data
print(df.head())
print(df.tail())
print(df.shape)
print(df.info())
print(df.dtypes)
print(df.columns)
print(df.index)
print(df.describe())
print(df.isnull().sum())
print(df['Category'].unique())
print(df['Category'].value_counts())

Summary Table

    Method/Attribute	        Purpose	                                Example Usage
    head(), tail()	            View sample rows	                    df.head(3)
    shape	                    Dimensions (rows, columns)	            df.shape
    info()	                    DataFrame summary, non-null counts      df.info()
    dtypes	                    Data types of columns	                df.dtypes
    columns, index	            List columns and row indices	        df.columns
    describe()	                Descriptive statistics	                df.describe()
    isnull(), sum()	            Detect/count missing values	            df.isnull().sum()
    unique(), nunique(), value_counts()	Analyze categorical data	df['col'].value_counts()

These methods are essential for efficiently inspecting and understanding your data before analysis or modeling.
