# Data Processing Tools Overview

## Key Libraries Used in Our ETL Proces

### pssvg)

**pandas** is a powerful Python library for data manipulation and analysis that provides:
- Data structures like DataFrame for efficient tabular data handling
- Tools for reading and writing data between in-memory data structures and formats like CSV
- Methods for filtering, transforming, and aggregating data
- Built-in data cleaning and preparation cpabilities, 'tools'])




The **random** module:
- Generates pseudo-random numbers for various distributions
- Provides functions for random sampling and selections
- Enables simulation of stochastic processes
- Supports reproducible randomness through seedss.

In [4]:
#if you have a csv 
# df = pd.read_csv("users.csv")

#random code to generate data
import pandas as pd
import random

names = ["Yashaswi Gurram","Alice Smith", "Bob Johnson", "Carlos Martinez", "Diana Chen", "Ella Brown", "Farhan Khan", "Gabriela Santos", "Hiroto Suzuki", "Isabella Rossi", "James Lee", "Khalid al-Farsi", "Linh Nguyen", "Maria Garcia", "Nisha Patel", "Oscar Müller", "Pierre Dubois", "Qi Wang", "Rania El-Amin", "Sophie Dubois", "Tom Becker", "Yusuf Ozturk"]
ages = random.choices(range(18,27), k=20)
countries = ["India","USA", "Canada", "Mexico", "China", "India", "Brazil", "Japan", "Italy", "Germany", "France", "Saudi Arabia", "Vietnam", "Spain", "UK", "Russia", "Turkey", "Egypt", "South Africa", "Australia", "UAE"]
majors = ["Data Science","Computer Science", "Engineering", "Business", "Biology", "Psychology", "Economics", "Art", "Mathematics", "Physics", "Law"]
interested_roles = ["Data Scientist","Data Scientist", "Software Developer", "Business Analyst", "Researcher", "UX Designer", "Consultant", "Marketing Manager", "Financial Analyst", "Project Manager", "HR Specialist"]
placed_roles = ["Data Engineer","Software Developer", "Business Analyst", "Project Manager", "Consultant", "Marketing Manager", "Researcher", "Financial Analyst", "Data Scientist", "UX Designer", "HR Specialist", None]

students = []
for i in range(20):
    name = names[i]
    age = ages[i]
    country = countries[i]
    major = random.choice(majors)
    interested_role = random.choice(interested_roles)
    placed_role = random.choice(placed_roles)
    students.append({
        "name": name,
        "age": age,
        "country": country,
        "major": major,
        "interested_role": interested_role,
        "role_placed": placed_role
    })

# Save as CSV
students_df = pd.DataFrame(students)
students_df.to_csv("students_sample.csv", index=False)

In [6]:
students_df.head()

Unnamed: 0,name,age,country,major,interested_role,role_placed
0,Yashaswi Gurram,22,India,Physics,Data Scientist,Data Engineer
1,Alice Smith,18,USA,Art,HR Specialist,Marketing Manager
2,Bob Johnson,22,Canada,Art,Business Analyst,HR Specialist
3,Carlos Martinez,25,Mexico,Economics,Data Scientist,Project Manager
4,Diana Chen,24,China,Psychology,UX Designer,Data Scientist


In [8]:
#lets add an extra column to the dataframe

students_df['experience'] = [random.randint(0, 5) for _ in range(len(students_df))]

In [10]:
students_df.head()

Unnamed: 0,name,age,country,major,interested_role,role_placed,experience
0,Yashaswi Gurram,22,India,Physics,Data Scientist,Data Engineer,4
1,Alice Smith,18,USA,Art,HR Specialist,Marketing Manager,3
2,Bob Johnson,22,Canada,Art,Business Analyst,HR Specialist,1
3,Carlos Martinez,25,Mexico,Economics,Data Scientist,Project Manager,2
4,Diana Chen,24,China,Psychology,UX Designer,Data Scientist,5


In [20]:
#lets try functions today
"""
def function_name(parameters):
        function body
    return
"""
#lets label the experience as entry_level, junior, senior

def experience_category(experience):
    if experience >= 5:
        return "Senior"
    elif experience >=3 and experience <5:
        return "Junior"
    else:
        return "entry_level"


In [22]:
#apply?
#pandas.apply() - This method is used to apply a function along an axis (rows or columns) of a pandas DataFrame.

students_df["experience_category"] = students_df["experience"].apply(experience_category)
experience_df = students_df[students_df["experience_category"] != "entry_level"]
experience_df.to_csv("experience.csv", index=False)
print(experience_df)

               name  age       country         major   interested_role  \
0   Yashaswi Gurram   22         India       Physics    Data Scientist   
1       Alice Smith   18           USA           Art     HR Specialist   
4        Diana Chen   24         China    Psychology       UX Designer   
6       Farhan Khan   18        Brazil       Biology       UX Designer   
8     Hiroto Suzuki   23         Italy   Engineering       UX Designer   
10        James Lee   20        France           Law        Researcher   
11  Khalid al-Farsi   20  Saudi Arabia           Art        Consultant   
12      Linh Nguyen   23       Vietnam  Data Science  Business Analyst   
15     Oscar Müller   26        Russia           Law     HR Specialist   

          role_placed  experience experience_category  
0       Data Engineer           4              Junior  
1   Marketing Manager           3              Junior  
4      Data Scientist           5              Senior  
6      Data Scientist           5  

## What We Accomplished:

1. **Created Reusable Functions**
   - Developed modular, reusable code components
   - Improved maintainability and readability

2. **Applied Expert Data Filtering**
   - Leveraged domain knowledge for data cormation

3. **Exported Processed Data**
   - Saved transformed data to new CSV files
   - Preserved original data while creating clean versions

## Process Overview

We completed a mini ETL (Extract, Transform, Load) pipeline:
- **Extract**: Read data from source files
- **Transform**: Applied filtering and processing functions
- **Load**: Saved results to new CSV files
