<a href="https://colab.research.google.com/github/PravallikaSomisetti/Sky-Brisk/blob/main/Week3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. THEORY**
### NumPy

**NumPy Arrays:** *Homogeneous data structures used for fast numerical computations.*

**Array Operations:** *Mathematical operations performed element-wise on arrays.*

**Broadcasting:** *A feature that allows NumPy to perform operations on arrays of different shapes.*

### Pandas

**Series:** *One-dimensional labeled data structure.*

**DataFrame:** Two-dimensional tabular data structure with rows and columns. *italicized text*

**Indexing:** *Selecting specific rows or columns from data.*

**Data Grouping:** *Splitting data into groups and applying aggregation functions like mean, sum, etc.*

# **Hands-On: NumPy arrays, operations, broadcasting**

In [1]:
import numpy as np

# Sales quantities
quantities = np.array([2, 1, 0, 1, 2, 3, 1])

In [2]:
# Prices
prices = np.array([55000, 20000, 15000, 55000, 0, 2000, 15000])

In [3]:
# Total sales using broadcasting
total_sales = quantities * prices

print("Quantities:", quantities)
print("Prices:", prices)
print("Total Sales:", total_sales)

Quantities: [2 1 0 1 2 3 1]
Prices: [55000 20000 15000 55000     0  2000 15000]
Total Sales: [110000  20000      0  55000      0   6000  15000]


In [4]:
# Average sales value
average_sales = np.mean(total_sales)
print("Average Sales Value:", average_sales)

Average Sales Value: 29428.571428571428


# **Manipulating Datasets with Pandas**

*Dataset manipulation in Pandas means loading data, cleaning it, selecting required rows/columns, modifying values, and performing analysis using DataFrames.*

## 1. Load a Dataset

In [7]:
import pandas as pd

df = pd.read_csv("/sales_data.csv.txt")
print(df)

   order_id     product  quantity    price
0         1      Laptop       2.0  55000.0
1         2      Mobile       1.0  20000.0
2         3      Tablet       NaN  15000.0
3         4      Laptop       1.0  55000.0
4         5      Mobile       2.0      NaN
5         6  Headphones       3.0   2000.0
6         7      Tablet       1.0  15000.0


# 2. View and Understand the Data

In [9]:
# First 5 rows
print(df.head())

   order_id product  quantity    price
0         1  Laptop       2.0  55000.0
1         2  Mobile       1.0  20000.0
2         3  Tablet       NaN  15000.0
3         4  Laptop       1.0  55000.0
4         5  Mobile       2.0      NaN


In [12]:
# Structure and data types
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   order_id  7 non-null      int64  
 1   product   7 non-null      object 
 2   quantity  6 non-null      float64
 3   price     6 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 356.0+ bytes
None


In [11]:
# Statistical summary
print(df.describe())

       order_id  quantity         price
count  7.000000  6.000000      6.000000
mean   4.000000  1.666667  27000.000000
std    2.160247  0.816497  22494.443758
min    1.000000  1.000000   2000.000000
25%    2.500000  1.000000  15000.000000
50%    4.000000  1.500000  17500.000000
75%    5.500000  2.000000  46250.000000
max    7.000000  3.000000  55000.000000


# 3. Remove Missing Values (Data Cleaning)

In [13]:
df_cleaned = df.dropna()

# 4. Select Columns & Rows (Indexing)

In [14]:
# Select one column
prices = df["price"]

# Select multiple columns
subset = df[["product", "price"]]

# Filter rows
filtered_data = df[df["price"] > 10000]

# 5. Modify / Create New Columns

In [15]:
df_cleaned["total_price"] = df_cleaned["quantity"] * df_cleaned["price"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned["total_price"] = df_cleaned["quantity"] * df_cleaned["price"]


# 6. Sort Data

In [16]:
sorted_df = df_cleaned.sort_values(by="price", ascending=False)


# 7. Group & Aggregate Data

In [17]:
average_sales = df_cleaned.groupby("product")["total_price"].mean()
print(average_sales)

product
Headphones     6000.0
Laptop        82500.0
Mobile        20000.0
Tablet        15000.0
Name: total_price, dtype: float64


# 8.Remove Duplicates

In [18]:
df_no_duplicates = df.drop_duplicates()

# 9.Rename Columns

In [19]:
df.rename(columns={"price": "unit_price"}, inplace=True)

# **3. CLIENT PROJECT (Real-World Dataset)**

**Problem Statement**

Clean a real-world sales dataset by:

Removing missing values

Calculating total and average sales

# Import Libraries

In [21]:
import numpy as np
import pandas as pd

# 1. Load a Dataset

In [20]:
df = pd.read_csv("/content/hospital_data.csv.txt")
print(df)

   patient_id   department   age  visit_cost
0           1   Cardiology  45.0      5000.0
1           2  Orthopedics  60.0      7000.0
2           3    Neurology   NaN      9000.0
3           4   Cardiology  45.0      5000.0
4           5  Orthopedics -10.0      6500.0
5           6    Neurology  30.0         NaN
6           7   Cardiology  55.0      8000.0
7           2  Orthopedics  60.0      7000.0


# 2: Remove duplicates

In [22]:
df = df.drop_duplicates()
print("\nAfter Removing Duplicates:")
print(df)


After Removing Duplicates:
   patient_id   department   age  visit_cost
0           1   Cardiology  45.0      5000.0
1           2  Orthopedics  60.0      7000.0
2           3    Neurology   NaN      9000.0
3           4   Cardiology  45.0      5000.0
4           5  Orthopedics -10.0      6500.0
5           6    Neurology  30.0         NaN
6           7   Cardiology  55.0      8000.0


# 3: Remove rows with missing values

In [23]:
df = df.dropna()
print("\nAfter Removing Missing Values:")
print(df)


After Removing Missing Values:
   patient_id   department   age  visit_cost
0           1   Cardiology  45.0      5000.0
1           2  Orthopedics  60.0      7000.0
3           4   Cardiology  45.0      5000.0
4           5  Orthopedics -10.0      6500.0
6           7   Cardiology  55.0      8000.0


# 4: Filter valid age values (0 to 100)

In [24]:
df = df[(df["age"] >= 0) & (df["age"] <= 100)]
print("\nAfter Filtering Valid Age:")
print(df)


After Filtering Valid Age:
   patient_id   department   age  visit_cost
0           1   Cardiology  45.0      5000.0
1           2  Orthopedics  60.0      7000.0
3           4   Cardiology  45.0      5000.0
6           7   Cardiology  55.0      8000.0


# 5: Calculate average age per department

In [25]:
average_age = df.groupby("department")["age"].mean()
print("\nAverage Age per Department:")
print(average_age)


Average Age per Department:
department
Cardiology     48.333333
Orthopedics    60.000000
Name: age, dtype: float64


# 6: Calculate average visit cost using NumPy

In [26]:
average_cost = np.mean(df["visit_cost"])
print("\nAverage Visit Cost:", average_cost)


Average Visit Cost: 6250.0


In this project, I worked with a real-world hospital dataset containing missing
values, duplicates, and invalid data. I cleaned the dataset using Pandas by
removing duplicates, handling missing values, and filtering valid records.
I then grouped the data to calculate the average patient age per department
and used NumPy to compute the average visit cost. This project demonstrates
real-world data cleaning and aggregation using Python.