# Python for Data Science

Python is used in data science for data preprocessing, feature engineering,
automation, and building data pipelines.



##Python Revision-Data Science Focus

This section covers Python concepts revised and practiced with a focus on
data analysis and real-world problem solving.

### Core Python
#### Variables and Data Types (`int`, `float`, `str`, `bool`).
-Use case: Storing numerical metrics, categorical values, and flags in        datasets.
#### Type Conversion and Basic Input / Output
-Use case: Converting raw data (strings) into numeric formats for analysis.

###Data Structures
#### Lists, Tuples, Sets
-Use case: Storing rows, unique values, and temporary collections during preprocessing.
#### Dictionaries
-Use case: Frequency counting, mapping categories, JSON-like data handling.
#### Common operations on collections
-Use case: Filtering, transforming, and aggregating dataset values.

### Control Flow
#### Conditional Statements (`if`, `elif`, `else`)
-Use case: Data validation and rule-based transformations.
#### Loops (`for`, `while`)
-Use case: Iterating over records and applying transformations.
#### Loop control (`break`, `continue`)
-Use case: Skipping corrupted records during data cleaning.

### Functions
#### User-defined functions
-Use case: Reusable data-cleaning and transformation logic.
#### Return values
-Use case: Returning processed datasets or computed metrics.
#### Function reuse and modular code
-Use case: Building maintainable data pipelines.

### Pythonic Coding
#### List Comprehensions
-Use case: Fast filtering and transformation of data.
#### Dictionary & Set Comprehensions
-Use case: Feature encoding and unique value extraction.

### Functional Programming
#### `lambda` functions
-Use case: Inline filtering and transformation logic.   
#### `map()`, `filter()`
-Use case: Applying transformations across datasets.

### Error Handling
#### Exception handling using `try`, `except`, `finally`
-Use case: Handling missing or corrupted data safely.

### File Handling (Basics)
#### Reading from files
-Use case: Loading CSV and text datasets. 
#### Writing to files
-Use case: Saving cleaned or transformed data

### Modules and Imports
#### Modules and Packages
-Use case: Organizing reusable Python code for data cleaning, feature engineering, and utilities.

#### Importing Modules (import, from … import)
-Use case: Using libraries like numpy, pandas, matplotlib, and custom utility scripts.

#### Standard Libraries (os, math, datetime)
-Use case: File handling, mathematical operations, and date/time feature creation.

#### Third-Party Libraries
-Use case: Leveraging data science libraries such as pandas for data analysis and numpy for numerical computing.

### Object-Oriented Programming (Basics)
#### Classes
-Use case: Creating structured data models (e.g., Dataset, DataCleaner).

#### Objects
-Use case: Representing real datasets or processing instances.

#### Methods
-Use case: Encapsulating data processing steps like cleaning, transformation, and validation.

#### init Constructor
-Use case: Initializing datasets, parameters, and configurations.

#### Instance Variables
-Use case: Storing dataset-specific values such as columns, thresholds, or metadata.

#### Basic Encapsulation
-Use case: Keeping data processing logic organized and reusable.


Goal: Build strong Python fundamentals required for Data Scientist roles through hands-on practice.




## Python Revision – Core Exercises
This contains beginner-friendly Python exercises focused on data cleaning,
transformation, and file handling for Data Science.

### 1) Remove Invalid Values from a List
#### Problem:
Remove invalid values (None, empty string, and non-numeric values) from a list.


In [None]:
# Input list with invalid values
data = [10, None, 25, "", 40, "abc", 60]

# Remove invalid values
cleaned_data = [x for x in data if isinstance(x, (int, float))]

print("Cleaned Data:", cleaned_data)

Use case: Data cleaning before analysis.

### 2)Convert String Prices to Float Safely
#### Problem:
Convert price values stored as strings into floats, ignoring invalid values.


In [None]:
prices = ["100.5", "200", "abc", "350.75", "", None]

converted_prices = []

for price in prices:
    try:
        converted_prices.append(float(price))
    except (ValueError, TypeError):
        pass

print("Converted Prices:", converted_prices)

Use case: Cleaning real-world price or salary data.

### 3) Count Frequency Using Dictionary
#### Problem:
Count how many times each item appears in a list.


In [None]:
items = ["apple", "banana", "apple", "orange", "banana", "apple"]

frequency = {}

for item in items:
    frequency[item] = frequency.get(item, 0) + 1

print("Item Frequency:", frequency)

Use case: Analyzing categorical data.

### 4)Filter Values Using Lambda Function
#### Problem:
Filter values greater than 50 from a list.

In [None]:
numbers = [10, 45, 60, 30, 90, 25, 80]

filtered_numbers = list(filter(lambda x: x > 50, numbers))

print("Filtered Numbers:", filtered_numbers)

Use case: Feature filtering and threshold-based selection.

### 5)Read CSV File & Print First 5 Rows
#### Problem:
Read a CSV file and display the first 5 rows.

In [None]:
import pandas as pd

# Read CSV file
df = pd.read_csv("data.csv")

# Display first 5 rows
print(df.head())

Use case: Initial data exploration.

#  NumPy for Data Science

NumPy is the core numerical library used in data science for
fast computations, vectorized operations, and statistical analysis.


### NumPy Arrays

NumPy arrays store numerical data efficiently and support
fast mathematical operations compared to Python lists.

-Use case :Used to store numerical features such as age,
salary, transaction amounts before applying ML algorithms.



In [None]:
import numpy as np

arr = np.array([10, 20, 30, 40, 50])
arr




### Shape and Data Type

Understanding array shape and data type is essential
to ensure compatibility with machine learning models.

-Use case: Helps validate input dimensions before
training models like Linear Regression or Neural Networks.



In [None]:
arr.shape, arr.dtype




### Indexing and Slicing

Indexing and slicing allow selecting subsets of data,
which is commonly used during data cleaning.

-Use case:Selecting specific feature columns
or filtering records during preprocessing.


In [None]:
arr[1:4]


### Statistical Operations

Statistical functions help summarize and understand
data distribution during EDA.

-Use case:Used in Exploratory Data Analysis (EDA)
to detect skewness, spread, and anomalies.    

In [None]:
arr.mean(), arr.median(), arr.std(), arr.sum()


### Boolean Masking

Boolean masking filters data using conditions,
often used to remove noise or outliers.

Use case:Removing invalid values such as
negative prices or unrealistic ages.


In [None]:
arr[arr > 25]


### Vectorized Operations

Vectorized operations apply calculations on entire arrays
without loops, improving performance.

-Use case: Scaling features efficiently
before feeding data into machine learning models.


In [None]:
arr + 10


### Conditional Logic using np.where()

np.where() applies conditional transformations,
useful in feature engineering.

-Use case:Creating new features such as
flagging high-value customers or risky transactions.


In [None]:
np.where(arr > 30, arr, 0)


### Handling Invalid Values

NumPy helps replace invalid values before
loading data into Pandas.

Use case: Cleaning datasets by converting
invalid entries to NaN for proper imputation.


In [None]:
arr2 = np.array([10, -1, 20, -1, 30])
np.where(arr2 < 0, np.nan, arr2)


# Pandas for Data Science

Pandas is used for data cleaning, transformation,
exploratory data analysis (EDA), and feature engineering.


### 1. Series and DataFrame

Series represents a single column,
DataFrame represents tabular data.

-Use case: creating structured datasets for analysis.


In [None]:
import pandas as pd

s = pd.Series([10, 20, 30])
df = pd.DataFrame({
    'name': ['A', 'B', 'C'],
    'age': [25, 30, 35]
})

s, df


### 2. Reading Data

Pandas supports loading data from CSV, Excel, SQL, etc.

-Use case: loading real-world datasets.


In [None]:
# Example CSV loading
# df = pd.read_csv('data.csv')

df.head()


### 3. Inspecting Data

Understanding structure and quality of data.

-Use case: detecting missing values and data types.


In [None]:
df.info()
df.describe()


### 4. Indexing and Filtering

Selecting rows and columns for analysis.

-Use case: filtering customer segments.


In [None]:
df.loc[df['age'] > 25]
df.iloc[0:2]


### 5. Handling Missing Values

Missing data is common in real datasets and must be handled.

-Use case: cleaning incomplete datasets.


In [None]:
df2 = pd.DataFrame({
    'age': [25, None, 30],
    'salary': [50000, 60000, None]
})

df2.isna()
df2.fillna(df2.mean())


### 6. Sorting Data

Sorting helps in ranking and prioritization.

-Use case: ranking top customers/products.


In [None]:
df.sort_values(by='age', ascending=False)


### 7. GroupBy and Aggregation

Used for summarizing data by categories.

-Use case: business-level insights.




In [None]:
sales_df = pd.DataFrame({
    'product': ['A', 'B', 'A', 'C'],
    'sales': [100, 200, 150, 300]
})

sales_df.groupby('product')['sales'].sum()


### 8. Apply and Lambda

Used for custom row-wise or column-wise operations.

-Use case: feature engineering.


In [None]:
sales_df['sales_level'] = sales_df['sales'].apply(
    lambda x: 'High' if x > 150 else 'Low'
)

sales_df


### 9. Merge and Concat

Combining multiple datasets.

-Use case: joining customer & transaction data.


In [None]:
df_left = pd.DataFrame({'id': [1, 2], 'name': ['A', 'B']})
df_right = pd.DataFrame({'id': [1, 2], 'salary': [50000, 60000]})

pd.merge(df_left, df_right, on='id')


### 10. DateTime Handling

Dates are important for time-based analysis.

-Use case: trend and time-series analysis.

In [None]:
df_date = pd.DataFrame({
    'date': ['2024-01-01', '2024-02-01']
})

df_date['date'] = pd.to_datetime(df_date['date'])
df_date


# Data Cleaning Pipeline using Pandas

Data cleaning is a critical step in data science.
Around 80% of a data scientist’s time is spent cleaning and preparing data
before analysis or modeling.


## Sample Dataset

In [None]:
import pandas as pd

df = pd.DataFrame({
    'name': ['A', 'B', 'C', 'A', None],
    'age': [25, None, 30, 25, 40],
    'salary': ['50000', '60000', None, '50000', '70000']
})

df


### Step 1: Understanding the Data

Before cleaning, it is important to inspect:
- Missing values
- Data types
- Duplicate records


In [None]:
df.info()
df.isna().sum()


### Step 2: Handling Missing Values

- Age column has missing values → filled using mean
- Salary missing values handled later after type conversion


In [None]:
df['age'].fillna(df['age'].mean(), inplace=True)
df


### Step 3: Fixing Data Types

Salary is stored as string.
It must be converted to numeric for analysis.


In [None]:
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
df


### Step 4: Handling Remaining Missing Values




In [None]:
df['salary'].fillna(df['salary'].median(), inplace=True)
df


### Step 5: Removing Duplicates

Duplicate records can bias analysis and must be removed.


In [None]:
df.drop_duplicates(inplace=True)
df


### Step 6: Feature Engineering

Creating new features helps extract insights from data.


In [None]:
df['salary_level'] = df['salary'].apply(
    lambda x: 'High' if x > 55000 else 'Low'
)

df


## Cleaning Pipeline Summary

1. Inspected missing values and data types
2. Handled missing numerical values using mean and median
3. Converted salary column to numeric
4. Removed duplicate records
5. Created a new feature for salary classification

This cleaned dataset is now ready for analysis or machine learning.
