 ### **What is `pip`?**  

`pip` (Package Installer for Python) is the default package manager for Python. It is used to install, upgrade, and manage third-party Python libraries from the **Python Package Index (PyPI)**.  

---

### **Key Features of `pip`**  
- Installs and manages Python libraries.  
- Supports package version control.  
- Allows installing from PyPI, local files, or repositories.  
- Can list installed packages and check for updates.  

---

### **Common `pip` Commands**  

| Command | Description |
|---------|-------------|
| `pip install package_name` | Installs a package |
| `pip install package_name==1.2.3` | Installs a specific version |
| `pip show package_name` | Displays package details |


 


In [None]:
#If pandas not installed put following command
!pip install pandas


## Setup

First we will import pandas in order to use it 

In [4]:
import pandas as pd

 ## Creating a Series 

# `Series` objects
The `pandas` library contains these useful data structures:
* `Series` objects, that we will discuss now. A `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).

In [None]:
#Creating Series (1-Dimentional)
cars=pd.Series(["BMW" , "Toyoto" , "Honda"])

#Printing Series
cars

## Creating a DataFrame

* DataFrame objects. This is a 2D table, similar to a spreadsheet (with column names and row labels).
* Panel objects. You can see a Panel as a dictionary of DataFrames. 
These are less used, so we will not discuss them here

In [None]:
colours = pd.Series(["Red" , "Pink" , "Blue"])

#DataFrame = 2-Dimensional
#DataFrames use dictionary in python

car_data=pd.DataFrame({"Car make" : cars , "Colours" : colours})

print(car_data)

In [None]:
# Create a dictionary with data
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [20, 21, 19, 22],
    "Marks": [85, 90, 78, 88]
}

# Convert dictionary to DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)


## Importing data in Pandas 

To work with datasets in Pandas, we first need to import the library and
then load the data from different sources like CSV, Excel, JSON, or SQL.

* We can use the read_csv() function to load data from a CSV file.
   - name = pd.read_csv("Path") 
* We can use the read_excel() function to load data from a excel file.
* We can use the read_json() function to load data from a json file.
* We can use the read_sql() function to load data from a SQL file:

In [8]:
#Importing Dataset
# Syntax - name = pd.read_csv("Path")
#forward slash should be used in path instead of backward slash

df = pd.read_csv("D:/ML SIG/car-sales-extended.csv")

In [None]:
df

## Exploring and Understanding data

* To see the first 5 rows of our dataset we can use .head() method
 - To see First n rows use .head(n) 
* To see the last 5 rows of our dataset we can use .tail() method
 - To see last n rows use .tail(n) 

In [None]:
#prints 5 rows by default
df.head()

In [None]:
df.head(7) #to print fisrt 7 rows

In [None]:
#Prints last 5 rows
df.tail()

In [None]:
#prints last n rows
df.tail(7) 

## Exploring and Understanding Data

## Checking Column names , dtypes , missing values & summary statistics
* `View Column Names`:  Use df.columns to list all column names.
* `Check Data Types`:  Use df.dtypes to identify numeric and categorical data.
* `Detect Missing Values`:  Use df.isnull().sum() to count missing values in each column.
* `Get Summary Statistics` :  df.describe() provides key metrics like mean, median, min, max, and standard deviation for numerical columns.


In [None]:
#To see all column names of a dataframe
df.columns

In [None]:
#To check datatypes of each columns
df.dtypes

In [None]:
df.isnull().sum()

In [None]:
# to get summary statistics of numerical data
df.describe()

## Indexing & Selecting Data

## How to select Columns ?
* To select a single column use :
       - df[“col_Name”] 
* To select multiple columns use :
       - df[[“col1” , “col2”]]

In [None]:
# select a single column
df["Doors"]

In [None]:
df[["Doors", "Price"]]

## Selecting rows 

You can use slicing (start:end:step) directly on a DataFrame to extract specific rows.
* start: Starting row index (inclusive)
* end: Ending row index (exclusive)
* step: (Optional) Step size

In [None]:
#selecting rows
df[7:15]

## Using loc and iloc

### **Selecting Data in Pandas: `.loc[]` vs `.iloc[]`**  

Pandas provides `.loc[]` and `.iloc[]` for selecting specific rows and columns from a DataFrame.  

---

### **1. `.loc[]` (Label-Based Selection)**  
- Selects data using **row and column labels**.  
- Supports selecting **single values, multiple values, slices, and conditions**.  
- Slicing is **inclusive** (includes both start and end labels).  
- Can be used with **boolean conditions** for filtering data.  

---

### **2. `.iloc[]` (Position-Based Selection)**  
- Selects data using **integer index positions**.  
- Supports selecting **single values, multiple values, and slices**.  
- Slicing is **exclusive** (does not include the endpoint).  
- Does not work with boolean conditions or column names.  

---

### **Key Differences Between `.loc[]` and `.iloc[]`**  

| Feature | `.loc[]` (Label-Based) | `.iloc[]` (Index-Based) |
|---------|----------------|----------------|
| Selection Type | Uses row/column labels | Uses integer positions |
| Slicing | Inclusive (includes endpoint) | Exclusive (excludes endpoint) |
| Supports Boolean Filtering | Yes | No |
| Works with Column Names | Yes | No |

Understanding `.loc[]` and `.iloc[]` helps in efficient data selection, filtering, and manipulation in Pandas.  


In [None]:
animals = pd.Series (["cat" , "Dog" , "bird" , "Panda" , "snake" ] , index = [0,3,9,8,3])
animals

Selects the row where index = 10 (Nissan, White, 167421.0 KM)


In [None]:
df.head(20)

In [None]:
df.loc[10]


Selecting Multiple Rows and Specific Columns using .loc[]

In [None]:
# Selects 'Make', 'Colour', and 'Price' for index 7 (Honda) and 9 (Honda)
df.loc[[7, 9], ["Make", "Colour", "Price"]]  

### Selecting Data using .iloc[] (Index-Based)

Selecting a Range of Rows and Columns using .iloc[]

In [None]:
animals


In [None]:
#iloc -> refers to the position of the object
animals.iloc[3]

In [None]:
#For this type of slicing we need to specify starting index , ending index and steps
#It does not consider the last index
animals.iloc[-1:-4:-1]

In [None]:
df.iloc[2]  # Selects the 3rd row (index 9: Honda, Blue, 51029.0 KM)



In [None]:
df.iloc[1:5, 1:4]  # Selects index 1 to 4 (exclusive) and columns 1 to 3 (Colour, Odometer, Doors)


## Data Cleaning & Preprocessing

### Handling Missing Values
* Fill missing values fillna() 
* Drop missing values dropna()

In [None]:
df["Odometer (KM)"].fillna(df["Odometer (KM)"].mean())

In [None]:
df.isnull().sum()

### **Handling Missing Values in Pandas: `fillna()` with `inplace=True` vs Assignment**  

In Pandas, missing values (`NaN`) can be replaced using `.fillna()`. There are two ways to do this:  

1. **Method 1: Using `inplace=True`** (Modifies DataFrame directly)  
   ```python
   df["Odometer (KM)"].fillna(df["Odometer (KM)"].mean(), inplace=True)
2. **Method 2 : Using `assignment operator`**(It makes a copy of the column)
    ```python
    df["Odometer (KM)"] =df["Odometer (KM)"].fillna(df["Odometer (KM)"].mean(), inplace=True)

In [None]:
df["Odometer (KM)"].fillna(df["Odometer (KM)"].mean(), inplace=True)

In [None]:
df.isnull().sum()

In [149]:
#df.dropna(inplace=True)

### **Handling Duplicate Values in Pandas**  

Duplicate values can cause data inconsistencies and affect analysis accuracy. Pandas provides methods to identify, remove, or manage duplicate entries in a DataFrame.  

- **`df.drop_duplicates()`**: Removes duplicate rows while keeping the first occurrence.  
- **`df.drop_duplicates(subset=["column_name"])`**: Drops duplicates based on specific columns.  
- **`inplace=True`**: Modifies the DataFrame without returning a new one.

![duplicateFIN.png](attachment:3d649cf6-ed65-422a-a851-b582db27cb7f.png)

In [None]:
df.duplicated().sum()

In [None]:
#df.drop_duplicates(inplace=True)

In [None]:
df.dtypes["Price"]

In [None]:
df["Price"] = df["Price"].astype(str)
df.dtypes["Price"]  

### **Different Data Types (`dtypes`) in a Pandas Dataset**  

A Pandas DataFrame can contain multiple data types. The `.dtypes` attribute shows the type of each column.  

#### **Common Data Types in Pandas**  

| Data Type | Description | Example |
|-----------|------------|---------|
| `int64`   | Integer values | `1, 100, -50` |
| `float64` | Decimal (floating-point) values | `1.5, 100.25, -0.99` |
| `object`  | Text or mixed data | `"Apple"`, `"Hello123"` |
| `bool`    | Boolean values (`True` or `False`) | `True, False` |
| `datetime64` | Date and time values | `2025-03-31 12:00:00` |
| `timedelta64` | Differences between dates/times | `5 days, 2 hours` |
| `category` | Categorical data for optimization | `"Red"`, `"Blue"`, `"Green"` |

#### **Example: Checking Data Types in a DataFrame**
```python
df.dtypes



## Filtering & Conditional Selection

### WHY ?
* Extract relevant data based on conditions
* Helps in targeted analysis
* Enables better decision-making
### To filter the data we can :
* Use comparison operators: >, <, ==, !=, etc.
* Use logical operators: & (AND), | (OR), ~ (NOT).
* Apply multiple conditions to refine data selection

In [None]:
#Filtering the data using conditional operator
df[df["Odometer (KM)"] >100000] 

In [None]:
#Filtering data using logical operator
df[(df["Odometer (KM)"] > 100000) & (df["Doors"] == 4)]

## Applying Functions to Data

In [None]:
#Applying function to Data
df["Doors"].head()

* 1] Using .apply() for Custom Transformations
    - It allows user to apply diiferent function to dataFrames
* 2]Using lambda function for quick transformation
    - It is faster way to apply simpler function

In [None]:
df["Doors"] = df["Doors"].apply(lambda x: x+1)
df["Doors"].head()

In [None]:
#sorting in ascending order
df.sort_values("Doors")

## Sorting & Ordering Data

### **Sorting Data in Pandas**  

Sorting helps organize data for better analysis. Pandas provides:  

1. Sorting by Column (`sort_values()`)**  
- `by="column_name"` → Specifies the column to sort.  
- `ascending=True/False` → Controls order (default: `True`).  
- `inplace=True/False` → Modifies DataFrame if `True`.  
- Can sort multiple columns with different orders.  


In [None]:
#sorting in descending order
df.sort_values("Doors", ascending = False)

# What next?
As you probably noticed by now, pandas is quite a large library with *many* features.
Although we went through the most important features, there is still a lot to discover.
Probably the best way to learn more is to get your hands dirty with some real-life data.
It is also a good idea to go through pandas'
excellent [documentation](http://pandas.pydata.org/pandas-docs/stable/index.html), in particular the [Cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html).

## Thank You !! 