### Definition and Examples of Data Collection and Data Cleaning

#### Data Collection
**Definition:**
Data Collection is the process of gathering information from various sources to analyze and make informed decisions. This can involve multiple methods such as surveys, sensors, logs, web scraping, APIs, and more.


**Real-World Example using Python:**
Suppose we want to collect data on daily weather conditions for a particular city using an API.

**Example:**
Let's use the OpenWeatherMap API to collect weather data.


In [None]:
import requests

# API key for OpenWeatherMap
api_key = 'your_api_key'
city = 'London'
url = f"http://api.openweathermap.org/data/2.5/weather?q={city}&appid={api_key}"

response = requests.get(url)
data = response.json()

print(data)


This script collects weather data for London by making an API request and printing the JSON response.

#### Data Cleaning
**Definition:**
Data Cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It involves handling missing values, removing duplicates, correcting errors, and ensuring the data is in a consistent format.

**Real-World Example using Python:**
Let's use a dataset on weather conditions that might have missing or incorrect values.


**Example:**
Consider we have a CSV file named `weather_data.csv` with columns: `date`, `temperature`, `humidity`, `wind_speed`.


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('weather_data.csv')

# Display the first few rows
print("Original Data:")
print(df.head())

# Handle missing values
# Fill missing temperature values with the mean temperature
df['temperature'].fillna(df['temperature'].mean(), inplace=True)

# Drop rows where humidity is missing
df.dropna(subset=['humidity'], inplace=True)

# Correct data types if needed
df['date'] = pd.to_datetime(df['date'])

# Remove duplicates
df.drop_duplicates(inplace=True)

# Display cleaned data
print("Cleaned Data:")
print(df.head())


This script loads a weather dataset, fills missing temperature values with the mean temperature, drops rows with missing humidity values, converts the `date` column to a datetime type, removes duplicate rows, and prints the cleaned dataset.

### Summary
- **Data Collection:** Gathering information from various sources. Example: Collecting weather data using the OpenWeatherMap API.
- **Data Cleaning:** Ensuring the collected data is accurate and usable. Example: Cleaning a CSV file by handling missing values and removing duplicates.
### Data Sources: Internal vs. External Formats with Examples

#### Internal Data Sources
**Internal data sources** refer to data that originates from within an organization. This data is often generated by the business operations and internal systems of the organization. Examples include customer databases, sales records, financial reports, and more.


##### Example 1: CSV
CSV files are commonly used for storing tabular data. Here's how to work with an internal CSV file in Python.


In [None]:
import pandas as pd

# Load internal CSV file
df_internal_csv = pd.read_csv('internal_sales_data.csv')

# Display the first few rows
print("Internal CSV Data:")
print(df_internal_csv.head())


##### Example 2: SQL
Many organizations use SQL databases to store internal data. Here's an example of querying an internal SQL database.


In [None]:
import sqlite3

# Connect to the internal SQLite database
conn = sqlite3.connect('internal_database.db')
cursor = conn.cursor()

# Query data from the internal database
query = "SELECT * FROM sales"
df_internal_sql = pd.read_sql_query(query, conn)

# Display the first few rows
print("Internal SQL Data:")
print(df_internal_sql.head())

# Close the connection
conn.close()


#### External Data Sources
**External data sources** refer to data that originates from outside the organization. This data is often obtained from third-party providers, public datasets, or APIs.


In [None]:
import requests

# URL of the external JSON data
url = 'https://api.exchangerate-api.com/v4/latest/USD'

# Fetch the JSON data
response = requests.get(url)
data_external_json = response.json()

# Display the JSON data
print("External JSON Data:")
print(data_external_json)
```

##### Example 2: APIs
APIs are a common way to fetch external data. Here's an example using an external API.

```python
import requests

# URL of the external API
api_url = 'https://api.openweathermap.org/data/2.5/weather?q=London&appid=your_api_key'

# Fetch the data from the API
response = requests.get(api_url)
data_external_api = response.json()

# Display the API data
print("External API Data:")
print(data_external_api)



### Summary
- **Internal Data Sources:**
  - CSV: Example of loading and displaying data from an internal CSV file.
  - SQL: Example of querying and displaying data from an internal SQL database.

- **External Data Sources:**
  - JSON: Example of fetching and displaying data from an external JSON file.
  - APIs: Example of fetching and displaying data from an external API.

Each example demonstrates how to load, query, and display data from different sources using Python, highlighting the differences between internal and external data formats.
