Reddy Sai Reddy Duggireddy KJ42777 week - 2 Data Structure and Analysis To UMBC

## Dive into Python: Week 2 Class on Data Structures & Analysis 🐍

This week's class is dedicated to mastering Python's fundamental programming concepts, with a special focus on data structures and analysis techniques.


We'll explore Python's basic data types like numerics, strings, tuples, and complex structures such as lists, sets, and dictionaries. We'll dive into iterative methods including for-loops, while-loops, and conditionals, along with the powerful list comprehension for efficient data manipulation.

Application:

we'll tackle the COVID-19 datasets. 

1. **Download and Preprocess Data:**
   - Access CSV files for confirmed cases, deaths, and recoveries from the [CSSE COVID-19 Dataset](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series).
   - Learn to preprocess and format this data using built-in Python libraries.

2. **Analyze without Pandas:**
The goal is to load, preprocess, and analyze the data for deaths cases and recovred, similar to the example provided for Confirmed cases. This will involve converting string dates to datetime objects, numerical values to integers, and organizing the data for analysis


Instructions:

**Data Retrieval:**

- Use the provided Python code snippet to download the CSV file containing the COVID-19 data.
- Make sure you have the requests library installed. If not, install it using pip install requests.

**Understanding the Data:**

- Write a Python script to open the CSV file and read the data.
- Find out the number of rows and columns in the CSV file.
- Print the names of the columns (headers) and the first five rows to understand the data structure.

**Data Preprocessing:**

- Remove any duplicate rows.
- Identify any columns that contain missing values. Write a Python function to fill missing values with a placeholder like 'Unknown' for string data or 0 for numerical data.
- Ensure all numerical columns containing case counts are converted to integer data types.

### Two main tasks 

1. Review the Code below 
2. Do the same for deaths and recovered cases 




**Resources:**

Please refer to the [Python Documentation](https://docs.python.org/3/) for additional help on the topics covered.

## Example cases for Confirmed 



In [6]:
# Data Reterival
import requests

# Direct link to the raw CSV file
url = 'https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'
r = requests.get(url)
with open('time_series_covid19_confirmed_global.csv', 'wb') as f:
    f.write(r.content)
print("File downloaded successfully")


File downloaded successfully


## 2. Understanding the Data:

In [7]:
import csv

# Open the CSV file and read the data
filename = 'C:\\Users\\saire\\OneDrive\\Desktop\\time_series_covid19_confirmed_US.csv'
with open(filename, mode='r') as file:
    csv_reader = csv.reader(file)
    headers = next(csv_reader)  # This is the header row
    data = list(csv_reader)     # Read the rest of the data

# Print the number of rows and columns
print(f"Number of rows: {len(data)}")
print(f"Number of columns: {len(headers)}")

# Print column names
print("Column names:")
print(headers)

# Print the first five rows of the data
print("\nFirst five rows:")
for row in data[:5]:
    print(row)



Number of rows: 3342
Number of columns: 1154
Column names:
['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Lat', 'Long_', 'Combined_Key', '1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20', '3/18/20', '3/19/20', '3/20/20', '3/21/20', '3/22/20', '3/23/20', '3/24/20', '3/25/20', '3/26/20', '3/27/20', '3/28/20', '3/29/20', '3/30/20', '3/31/20', '4/1/20', '4/2/20', '4/3/20', '4/4/20', '4/5/20', '4/6/20', '4/7/20', 

## 3. Data Preprocessing:


In [8]:
import csv

# Open the CSV file and read the data
filename = 'C:\\Users\\saire\\OneDrive\\Desktop\\time_series_covid19_confirmed_US.csv'
with open(filename, mode='r') as file:
    csv_reader = csv.reader(file)
    headers = next(csv_reader)  # This is the header row
    data = list(csv_reader)

# Remove duplicate rows
unique_data = [list(t) for t in set(tuple(element) for element in data)]

# Check if a header is a date column
def is_date_column(header):
    parts = header.split('/')
    return len(parts) == 3 and all(part.isdigit() for part in parts)

# Fill missing values with 'Unknown' or 0
def fill_missing_values(row, headers):
    filled_row = []
    for i, value in enumerate(row):
        if is_date_column(headers[i]):
            filled_row.append(int(value) if value.isdigit() else 0)
        elif headers[i] in ['Lat', 'Long_']:  # These are float columns
            filled_row.append(float(value) if value else 0.0)
        else:
            filled_row.append(value if value else 'Unknown')
    return filled_row

# Fill missing values for all data
filled_data = [fill_missing_values(row, headers) for row in unique_data]

print("Data preprocessing complete.")

print("No rows in original data:", len(data))
print("No of rows after removing duplicates:", len(unique_data))
print("No of rows after preprocessing:", len(filled_data))
print("\nFirst five rows of preprocessed data:")
for row in filled_data[:5]:
    print(row)

Data preprocessing complete.
No rows in original data: 3342
No of rows after removing duplicates: 3342
No of rows after preprocessing: 3342

First five rows of preprocessed data:
['84040083', 'US', 'USA', '840', '40083.0', 'Logan', 'Oklahoma', 'US', 35.91899576, -97.44352845, 'Logan, Oklahoma, US', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 10, 10, 11, 12, 12, 13, 13, 13, 14, 15, 16, 16, 16, 16, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 21, 21, 21, 21, 22, 22, 23, 23, 25, 24, 24, 25, 26, 26, 28, 30, 33, 35, 37, 39, 41, 43, 44, 49, 49, 52, 54, 55, 60, 62, 67, 70, 75, 85, 90, 98, 98, 101, 111, 117, 114, 119, 120, 120, 123, 128, 130, 138, 143, 147, 150, 159, 164, 164, 170, 180,

## 4. Descriptive Statistics:

In [10]:
from collections import defaultdict

# Calculate total confirmed cases for each 'Province_State'
confirmed_cases_by_state = defaultdict(int)
for row in filled_data:
    state = row[headers.index('Province_State')]  # Adjust this if the header is different
    # Iterate over each date column to sum the case counts
    for date in headers:
        if date not in ['UID','iso2','iso3','code3','FIPS','Admin2','Province_State', 'Country_Region', 'Lat', 'Long_','Combined_Key']:
            try:
                total_cases = int(row[headers.index(date)])
                confirmed_cases_by_state[state] += total_cases
            except ValueError:
                continue

# Find and print the 'Province_State' with the highest total confirmed cases
max_state = max(confirmed_cases_by_state, key=confirmed_cases_by_state.get)
print(f"The state with the most confirmed cases is {max_state} with {confirmed_cases_by_state[max_state]} cases.")

# Find the date with the highest number of reported cases across all states
date_columns = [col for col in headers if '/' in col]  # Assuming date columns contain '/'
cases_by_date = defaultdict(int)
for row in filled_data:
    for date in date_columns:
        try:
            cases_by_date[date] += int(row[headers.index(date)])
        except ValueError:
            continue

max_date = max(cases_by_date, key=cases_by_date.get)
print(f"The date with the highest number of reported cases is {max_date} with {cases_by_date[max_date]} cases.")


The state with the most confirmed cases is California with 6166190335 cases.
The date with the highest number of reported cases is 3/9/23 with 103802702 cases.
