<a href="https://colab.research.google.com/github/Tealexkay/Midterm-project/blob/main/Day2_basics_managing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 2: Managing Data



## 1. What is data? What is a dataset?
Data can be thought of as **information** stored in some format that can be used for **analysis** or other computational tasks. A **dataset** is simply a collection of such data.

Examples:
- A table of users with columns such as name, age, and country.
- A CSV file containing all the transactions for an online store.
- A JSON file containing responses from an API.

### Why do we care?
- We use data to find **patterns**, **insights**, and **answer questions**.
- Data is the foundation of **data analysis**,**machine learning**, **data science**, and many business **decision-making** processes.


## 2. Formats for data
Data can be stored in many different file formats, each with its own advantages and disadvantages:

- **CSV (Comma-Separated Values):** Simple, plain-text files. Easy to read and widely used.
- **JSON (JavaScript Object Notation):** Common for web APIs, easy to parse, can store nested data.
- **XLSX (Excel spreadsheets):** Common in business settings, can contain multiple sheets.
- **Plain text (TXT):** Flexible but usually less structured.
- Others: **Parquet**, **HDF5**, **SQL databases**, etc.

Knowing how data is stored and how to read it is crucial for data analysis.

## 3. Where to find data online?
If you don’t have your own data, you can:

- Use **open data portals** (e.g., [data.gov](https://www.data.gov/), [Kaggle Datasets](https://www.kaggle.com/datasets),etc).
- Download data from **public repositories** on GitHub.
- Use **APIs** to query data directly (e.g., Twitter API, OpenWeatherMap, etc.).
- Create your own data by **scraping websites**, or by collecting logs from an application.


## 4. What are Libraries (Packages) ?

In **Python**, a **library** is a collection of modules and functions that provide reusable code to perform common tasks. Libraries help developers save time by offering ready-made solutions for various purposes, such as mathematical operations, data analysis, web development, and more.

Using libraries allows you to focus on solving your specific problem instead of writing everything from scratch. For instance, instead of manually writing code to handle large datasets, you can use a library like **Pandas** to simplify the process.


NOTE: The terms libraries and packages are sometimes used interchangeably, even though they have distinct meanings. However, in casual usage, people often refer to them as the same thing.

### Examples of Popular Python Libraries

- **NumPy**: For numerical computations
- **Pandas**: For data analysis and manipulation
- **Matplotlib**: For data visualization
- **Seaborn**: For statistical data visualization

## How to Load a Library

To load a library in Python, you use the `import` statement. Here are different ways to import and use libraries:

### Importing the Entire Library

The simplest way to use a library is to import it entirely:

In [None]:
import pandas
import numpy

### Importing with an Alias

You can import a library with an **alias** to make your code more concise and readable, especially if the library name is long. This is useful when a library is used frequently throughout your code.

To give an alias to a library, you use the `as` keyword:

In [None]:
import pandas as pd
import numpy as np

NOTE: You can install packages that are not pre-installed in Google Colab by running **!pip install package_name** in a code cell if necessary. Additionally, if you choose to run your notebooks locally using an IDE such as VS Code, remember that you may need to install the required packages beforehand.



## 5. Using pandas to load data (Intro to DataFrames)
A **DataFrame** is a 2-dimensional labeled data structure with columns (like a spreadsheet).

### Example of loading a dataset
Let's create a small dictionary **on the fly** and load it using **pandas**:


In [None]:
import pandas as pd

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 22, 25, 28],
    'Score': [88, 92, 85, 91, 95]
}

# Convert to DataFrame
df = pd.DataFrame(data)
df

## 6. How to use data in Google Colab?
There are multiple ways to bring your data into Colab:

1. **Upload directly** to Colab.
2. **Connect to Google Drive**:
   - Mount your Drive and read the file as if it’s in your local filesystem.
3. **Use a raw link from GitHub**:
   - If the data is hosted in a public repo, you can obtain a **raw file link** to read directly.
4. **Using APIs** (not covered in this class), you can retrieve data using `requests` or other libraries.


### Example: Connect to Google Drive
Below is how you could connect your notebook to your Google Drive in Colab:


In [None]:
# Uncomment and run the following in Google Colab:
from google.colab import drive
drive.mount('/content/drive/')

# After mounting, you can access files in 'My Drive' under '/content/drive/My Drive'

In [None]:
# example importing data from my google drive (you need to update the link for the data in your drive)
import pandas as pd
weather_drive= pd.read_csv("/content/drive/MyDrive/Lehman College Spring 2025/MAT 301/datasets/KNYC.csv")
weather_drive.head()

### Example: Using a raw link from GitHub
You can directly pass a **raw GitHub URL** to pandas (or `requests` library) to read the data:


In [None]:
import pandas as pd

# Example CSV file hosted on GitHub (raw link)
url = "https://raw.githubusercontent.com/liger1apwm/MAT-301_Applied_Stats_Data_Analysis/refs/heads/main/data/KNYC.csv"
weather_github = pd.read_csv(url)
weather_github.head()

this way is preferred if

### Example: Reading a .xlsx file

In [None]:
import pandas as pd

# Replace with your actual file path
file_path = "/content/drive/MyDrive/Lehman College Spring 2025/MAT 301/datasets/Canada.xlsx"

# Read the Excel file, all sheet
excel_df = pd.read_excel(file_path, sheet_name=None)


excel_df

to read an specific sheet and skiprows we can do the following:

In [None]:
excel_df_sheet = pd.read_excel(file_path, sheet_name="Canada by Citizenship", skiprows=1)


excel_df_sheet

## 7. Exploring Data
### 7.1 Checking the beginning, end, or random rows
- `df.head()` shows the **first 5** rows.
- `df.tail()` shows the **last 5** rows.
- `df.sample()` shows **random** rows.

In any of the previous functions, we can pass an integer as input, which will adjust the number of rows displayed depending on the value of the integer.
- If the integer is positive, it will return the first n rows.
- If the integer is negative, it will return all rows except the last n rows.

### 7.2 Checking column types
Use `df.dtypes` to see the data types of all columns.

Sometimes, numeric columns are stored as **strings**. We can convert them by using `pd.to_numeric()` or other appropriate functions.

### 7.3 Date columns
It's helpful to store dates as **datetime** objects. We can convert them with `pd.to_datetime()`.

If date columns remain as strings, pandas can’t easily perform date/time operations like extracting the **month**, **year**, or calculating **durations** between dates.

In [None]:
# Exploring the DataFrame

print("HEAD:")
display(weather_github.head(2))



In [None]:
print("\nTAIL:")
display(weather_github.tail(2))

In [None]:
print("\nSAMPLE:")
display(weather_github.sample(2))


In [None]:
print("\nDATA TYPES:")
print(weather_github.dtypes)


In [None]:
# Converting Date column to datetime

weather_github['date'] = pd.to_datetime(weather_github['date'])
print("\nDATA TYPES AFTER CONVERSION:")
print(weather_github.dtypes)



In [None]:
# Example: extracting year from Date

weather_github['Year'] = weather_github['date'].dt.year
print("\nHEAD AFTER ADDING YEAR COLUMN:")
display(weather_github.head(2))

## 8. Detecting and Dealing with Missing Values or Duplicates
### 8.1 Missing Values
Missing values appear as **NaN** in pandas. We can detect them using:
- `df.isnull()` or `df.isna()` (they are equivalent in pandas).
- Summarize with `df.isnull().sum()` to see how many missing values in each column.

We can **fill** missing values or **drop** rows containing missing values:
- `df.fillna(value)` to fill.
- `df.dropna()` to drop.

### 8.2 Duplicates
Duplicate rows can be detected with `df.duplicated()` and removed with `df.drop_duplicates()`.

In [None]:
# Example of missing values
import numpy as np

# Introduce a missing value
weather_github.loc[1, 'Name'] = np.nan
display(weather_github.head())



In [None]:
# Detect missing values
print("Missing values per column:")
print(weather_github.isnull().sum())



In [None]:
# Fill missing name with 'Unknown'
weather_github['Name'].fillna('Unknown', inplace=True) #use .dropna() to eliminate the rows with na
display(weather_github.head())



In [None]:
# Example of duplicate row
weather_github_duplicate = pd.concat([weather_github.iloc[[0]],weather_github], ignore_index=True)

print("\nCheck duplicate row:")

weather_github_duplicate.head()


In [None]:
# Find duplicate rows
duplicates = weather_github_duplicate.duplicated()

# Display duplicate rows
weather_github_duplicate[duplicates]

In [None]:
# Remove duplicates
weather_github_duplicate.drop_duplicates(inplace=True)
print("After dropping duplicates:")
display(weather_github_duplicate.head())

## 9. Selecting Desired Columns
If you only need certain columns from the DataFrame, you can **select** them by name.

```python
df[['date', 'temp']]
```

Example:

In [None]:
# Selecting columns
weather_subset = weather_github[['date', 'actual_mean_temp']]
print("\nSubset of DataFrame (date, actual_mean_temp):")
display(weather_subset.head())



## 10. Creating a New Column Based on Another Column
You can create a **new column** by applying an operation or function to an existing column. For example, if you have an Age column, you might convert age to the number of months:

Example:

```python
df['Age_in_Months'] = df['Age'] * 12
```

Or apply a more complex function.

Now, lets see an example using the previous weather data:

In [None]:
# Creating a new column
weather_subset['temp_celcius'] = (weather_subset['actual_mean_temp'] - 32) * 5 / 9.0
print("DataFrame with new column 'temp_celcius':")
display(weather_subset.head())

NOTE: To suppress the warning, you can use the following code beforehand. However, always ensure you understand what you are suppressing, as some warnings can help prevent future errors.

In [None]:
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')

# Creating a new column
weather_subset['temp_celcius'] = (weather_subset['actual_mean_temp'] - 32) * 5 / 9.0
print("DataFrame with new column 'temp_celcius':")
display(weather_subset.head())

# Summary
In **Day 2**, we covered:
1. Understanding data and datasets
2. Common file formats
3. Ways to source data (online, created, or via APIs)
4. Using data in Google Colab (Drive, GitHub)
5. Important Python libraries for data management (pandas, numpy)
6. Loading data into pandas DataFrames
7. Exploring data (head, tail, sample, dtypes)
8. Handling date/time fields
9. Dealing with missing and duplicate data
10. Selecting and creating new columns

In the next sessions, we'll dig deeper into data cleaning, transformation, and visualization!