<a href="https://colab.research.google.com/github/Dharaneesh-EM/Dharaneesh-EM/blob/main/data_science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You need to collect data on housing prices for a real estate project.Dataset: For this question, you won't provide a specific dataset. Instead, focus on the process of acquiring data. Question: Describe a plan to acquire the necessary data from at least three different sources (e.g., web scraping real estate websites, using public APIs for property data, accessing government datasets on property values). Outline the steps you would take to gather the data, handle any inconsistencies or missing values, and combine it into a usable format for analysis. Discuss the challenges and limitations of each data source.

In [1]:
import pandas as pd

# Simulated data from different sources (Web Scraping, APIs, Government Data)
data_zillow = pd.DataFrame({
    'Property_ID': [101, 104],
    'Address': ['123 Main St', '1011 Maple Ln'],
    'City': ['New York', 'Miami'],
    'Price': [550000, 680000],
    'Sq_Ft': [1200, 2000],
    'Bedrooms': [3, 5],
    'Source': ['Zillow', 'Zillow'],
    'Date_Updated': ['2024-02-01', '2024-02-03']
})

data_realtor = pd.DataFrame({
    'Property_ID': [103],
    'Address': ['789 Pine Dr'],
    'City': ['Houston'],
    'Price': [300000],
    'Sq_Ft': [1800],
    'Bedrooms': [3],
    'Source': ['Realtor'],
    'Date_Updated': ['2024-02-05']
})

data_gov = pd.DataFrame({
    'Property_ID': [102, 105],
    'Address': ['456 Oak Ave', '1213 Cedar Rd'],
    'City': ['Chicago', 'Seattle'],
    'Price': [420000, 490000],
    'Sq_Ft': [1500, 1400],
    'Bedrooms': [4, 3],
    'Source': ['Govt API', 'Govt API'],
    'Date_Updated': ['2024-01-20', '2024-01-25']
})

# Convert Date_Updated to datetime format for proper merging
for df in [data_zillow, data_realtor, data_gov]:
    df['Date_Updated'] = pd.to_datetime(df['Date_Updated'])

# Merge all data sources into a single DataFrame
merged_data = pd.concat([data_zillow, data_realtor, data_gov], ignore_index=True)

# Handle missing values (if any)
merged_data.fillna({'Price': merged_data['Price'].median(), 'Sq_Ft': merged_data['Sq_Ft'].median()}, inplace=True)

# Sort by Date_Updated to keep the latest data
merged_data.sort_values(by='Date_Updated', ascending=False, inplace=True)

# Reset index after sorting
merged_data.reset_index(drop=True, inplace=True)

# Display final cleaned dataset
print(merged_data)


   Property_ID        Address      City   Price  Sq_Ft  Bedrooms    Source  \
0          103    789 Pine Dr   Houston  300000   1800         3   Realtor   
1          104  1011 Maple Ln     Miami  680000   2000         5    Zillow   
2          101    123 Main St  New York  550000   1200         3    Zillow   
3          105  1213 Cedar Rd   Seattle  490000   1400         3  Govt API   
4          102    456 Oak Ave   Chicago  420000   1500         4  Govt API   

  Date_Updated  
0   2024-02-05  
1   2024-02-03  
2   2024-02-01  
3   2024-01-25  
4   2024-01-20  


To collect housing price data, we can use three main sources:

Web Scraping Real Estate Websites

Extracts real-time listings from sites like Zillow or Realtor.com.
Challenges: Legal restrictions, IP blocking, and missing data.
Public APIs for Property Data

APIs from Zillow, Redfin, or government portals provide structured data.
Challenges: Limited free access, restricted data fields, and format differences.
Government & Open Data Portals

Official records on property values, taxes, and sales history.
Challenges: Delayed updates, inconsistent formats, and accessibility issues.
Steps to Clean & Use Data:
Remove duplicates, fill missing values, and standardize formats.
Merge datasets using common property details.
Analyze trends, price fluctuations, and investment opportunities.

Okay, let's generate more scenario-based, program-based questions at Bloom's Taxonomy Levels 3 (Applying) and 4 (Analyzing) for the topics you provided, including datasets or dataset generation instructions.
NumPy - Applying (Level 3) Scenario: You're working with audio data represented as a 1D NumPy array. You need to apply a fade-in effect to the beginning of the audio.

Reasoning: This task requires applying a linear fade-in effect over a specified period and then visualizing the impact on the audio signal, testing your understanding of signal manipulation and NumPy arrays.

In [2]:
import numpy as np
import matplotlib.pyplot as plt  # For visualization (optional)

# Create sample audio data (a sine wave for demonstration)
sample_rate = 44100  # Samples per second
duration = 5  # Seconds
time = np.linspace(0, duration, int(sample_rate * duration), False)
audio_data = np.sin(2 * np.pi * 440 * time)  # 440 Hz sine wave


NumPy - Analyzing (Level 4) Scenario: You are analyzing time-series data representing stock prices. You need to calculate the moving average of the stock prices to identify trends.

In [None]:
import numpy as np
import pandas as pd

# Create sample stock price data (replace with your actual data)
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100)
stock_prices = 100 + np.cumsum(np.random.randn(100))
data = pd.DataFrame({'Date': dates, 'Price': stock_prices})
prices = data['Price'].to_numpy()  # Get numpy array of prices


Pandas - Applying (Level 3) Scenario: You have a dataset of student grades in different subjects. You need to calculate each student's average grade and assign letter grades based on the average.

In [None]:
import pandas as pd
data = {'StudentID': [1, 2, 3, 4, 5],
        'Math': [85, 92, 78, 88, 95],
        'Science': [90, 88, 85, 92, 80],
        'English': [75, 80, 92, 85, 90]}
df = pd.DataFrame(data)

Pandas - Analyzing (Level 4) Scenario: You are analyzing sales data for different product categories. You need to identify the best-selling categories and analyze their sales trends over time.

In [3]:
import pandas as pd
import numpy as np

np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=90)  # 3 months of data
categories = ['Electronics', 'Clothing', 'Books']
sales = {cat: 100 + np.cumsum(np.random.randint(-10, 20, size=90)) for cat in categories}
df = pd.DataFrame(sales, index=dates)
df = df.resample('W').sum()  # Aggregate weekly sales


10. Scenario: You have collected data on air quality from different sources, including government agencies, research institutions, and citizen science initiatives. Dataset: No specific dataset is provided. The focus is on analyzing the data sources. Question: Analyze the challenges of combining air quality data from these diverse sources. Discuss potential issues related to data accuracy, consistency, measurement methods, and spatial and temporal coverage. How would you validate and harmonize the data before using it for analysis or modeling? What are the ethical considerations involved in using citizen science data for environmental monitoring?

In [None]:
import pandas as pd
import numpy as np

# Create synthetic datasets for three different sources
np.random.seed(42)

# Government data (accurate and consistent)
gov_data = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=10, freq='D'),
    'PM2.5': np.random.normal(50, 10, 10)  # in µg/m³
})

# Research institution data (accurate but with some missing values)
research_data = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=10, freq='D'),
    'PM2.5': np.random.normal(52, 8, 10)  # in µg/m³
})
research_data.loc[3, 'PM2.5'] = np.nan  # Introducing missing value

# Citizen science data (lower accuracy, might have some bias)
citizen_data = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=10, freq='D'),
    'PM2.5': np.random.normal(55, 12, 10)  # in µg/m³
})
citizen_data['PM2.5'] += 5  # Citizen sensors could have calibration bias

# Merging all dataframes (inner join on Date)
merged_data = pd.merge(gov_data, research_data, on='Date', suffixes=('_gov', '_research'))
merged_data = pd.merge(merged_data, citizen_data, on='Date', suffixes=('', '_citizen'))

# Fill missing values using the median of the column
merged_data['PM2.5_research'] = merged_data['PM2.5_research'].fillna(merged_data['PM2.5_research'].median())

# Standardizing PM2.5 values: Convert all sources to the same unit (µg/m³) (already in the same unit in this example)

# Average the PM2.5 readings across sources
merged_data['PM2.5_avg'] = merged_data[['PM2.5_gov', 'PM2.5_research', 'PM2.5_citizen']].mean(axis=1)

# Displaying the cleaned and harmonized data
print(merged_data)


Explanation of Code: Data Creation:

We create three synthetic datasets representing data from government, research institutions, and citizen science. Each dataset has a Date column and a PM2.5 column (representing air quality), and we assume all data is in the unit of µg/m³. For research data, we introduce a missing value at index 3 for demonstration purposes. Data Merging:

We merge the datasets based on the Date column using pd.merge(). This combines all the data from the three sources into one DataFrame. The suffixes argument ensures that columns with the same name (PM2.5) from different sources are labeled appropriately (_gov, _research, _citizen). Handling Missing Data:

We handle missing data in the PM2.5_research column using the fillna() method. In this case, we fill the missing value with the median of the available values in that column. Standardizing Data:

In this example, all the data is already in the same unit (µg/m³), but in a real scenario, you may need to convert units (e.g., from ppm to µg/m³). This is a critical step when working with multiple data sources. Calculating Average:

To get a harmonized measurement of PM2.5 from all sources, we calculate the average of the three columns (PM2.5_gov, PM2.5_research, and PM2.5_citizen) using the .mean(axis=1) method. This averages the values row-wise (for each day). Output:

Finally, the cleaned and combined data is printed. This DataFrame includes columns from each data source and an average PM2.5 value.