# Creating DataFrames from Scratch
_This section covers how to create DataFrames from different data sources, including lists, dictionaries, and external sources._

---

## Contents
1. **Introduction**  
   - Overview of DataFrame creation
   - Importance of structured data

2. **Key Concepts**  
   - Creating DataFrames from dictionaries
   - Using external sources to create DataFrames  

3. **Practical Exercises**  
   Explanation of what the user will learn in this section.

---

## Author
**Author Name:** Juan Alejandro Carrillo Jaimes  

**Contact:** [jalejandrocjaimes@gmail.com](mailto:jalejandrocjaimes@gmail.com) - [Linkedin-AlejoCJaimes31](https://www.linkedin.com/in/alejocjaimes31/)  

**Purpose:** This content was created as an educational resource for university students.

# 1. Introduction
Usually, we create a DataFrame from an existing file or a database, but we can also create one from scratch. We can create a DataFrame from parallel list of data.

## Overview of DataFrame creation
![DataFrame-Python](https://pynative.com/wp-content/uploads/2021/02/pandas-dataframe-from-dictionary.png)

Creating a DataFrame is a fundamental step in data analysis using Pandas. A DataFrame is a two-dimensional labeled data structure, similar to a table in a database or an Excel spreadsheet. It allows data to be manipulated, analyzed, and visualized efficiently.

## Importance of structured data
![str-unstr-data](https://i.ytimg.com/vi/sf2S6ZI9BD0/maxresdefault.jpg)

Structured data is essential for effective data analysis, enabling easy querying, transformation, and visualization. Properly structured DataFrames improve data integrity, reduce errors, and enhance the efficiency of operations such as filtering, aggregation, and merging.

# 2. Key concepts

In [2]:
# import the principal libraries
import pandas as pd # type: ignore
import numpy as np # type: ignore

## 2.1 Creating DataFrames from dictionaries

1. Create parallel lists wit your data in them. Each of these lists will be a column in the DataFrame, so they should have the same type

In [3]:
import random as rand

first_name = ["Sallee","Joseph","Harmonia","Tracie","Liana","Elias","Allsun","Megen","Gaven","Mic"]
last_name  = ["Shovelbottom","Cottem","Scotts","MattiCCI","Reason","Erlam","Fenby","Peirpoint","Lackinton","Aldersea"]
email      = ["sshovelbottom0@china.com.cn","jcottem1@creativecommons.org","hscotts2@comcast.net","tmatticci3@domainmarket.com","lreason4@php.net","eerlam5@cbsnews.com","afenby6@stumbleupon.com","mpeirpoint7@shinystat.com","glackinton8@theglobeandmail.com","maldersea9@un.org"]
gender = [rand.choice(['Male','Female']) for _ in range(10)]
ip_address = ["222.134.225.141","185.200.169.205","163.216.177.34","237.176.132.172","123.184.153.13","78.225.107.21","94.23.76.234","7.42.125.107","236.70.201.94","163.81.57.207"]

2. Create a dictionary from the lists, mapping the column name to the list

In [5]:
# create dictionary
people = {"FirstName": first_name, "LastName" : last_name, "Email": email, "Gender": gender, "IpAddress": ip_address}

3. Create a DataFrame from the dictionary

In [6]:
# create DataFrame
df = pd.DataFrame(people)
df

Unnamed: 0,FirstName,LastName,Email,Gender,IpAddress
0,Sallee,Shovelbottom,sshovelbottom0@china.com.cn,Female,222.134.225.141
1,Joseph,Cottem,jcottem1@creativecommons.org,Female,185.200.169.205
2,Harmonia,Scotts,hscotts2@comcast.net,Female,163.216.177.34
3,Tracie,MattiCCI,tmatticci3@domainmarket.com,Female,237.176.132.172
4,Liana,Reason,lreason4@php.net,Female,123.184.153.13
5,Elias,Erlam,eerlam5@cbsnews.com,Female,78.225.107.21
6,Allsun,Fenby,afenby6@stumbleupon.com,Male,94.23.76.234
7,Megen,Peirpoint,mpeirpoint7@shinystat.com,Female,7.42.125.107
8,Gaven,Lackinton,glackinton8@theglobeandmail.com,Female,236.70.201.94
9,Mic,Aldersea,maldersea9@un.org,Female,163.81.57.207


#### How it works?
By default, pandas will create a `RangeIndex` for our DataFrame when we call the constructor.

```py
df.index
```
We can specidy index for the DataFrame if we desire:
```py
pd.DataFrame(people, index=[chr(c) for c in range(98,108)])
```

In [7]:
df.index

RangeIndex(start=0, stop=10, step=1)

In [18]:
pd.DataFrame(people, index=[chr(c) for c in range(97,107)])

Unnamed: 0,FirstName,LastName,Email,Gender,IpAddress
a,Sallee,Shovelbottom,sshovelbottom0@china.com.cn,Female,222.134.225.141
b,Joseph,Cottem,jcottem1@creativecommons.org,Female,185.200.169.205
c,Harmonia,Scotts,hscotts2@comcast.net,Female,163.216.177.34
d,Tracie,MattiCCI,tmatticci3@domainmarket.com,Female,237.176.132.172
e,Liana,Reason,lreason4@php.net,Female,123.184.153.13
f,Elias,Erlam,eerlam5@cbsnews.com,Female,78.225.107.21
g,Allsun,Fenby,afenby6@stumbleupon.com,Male,94.23.76.234
h,Megen,Peirpoint,mpeirpoint7@shinystat.com,Female,7.42.125.107
i,Gaven,Lackinton,glackinton8@theglobeandmail.com,Female,236.70.201.94
j,Mic,Aldersea,maldersea9@un.org,Female,163.81.57.207


## 2.2 Using External Sources to Create DataFrames

1. Creating a DataFrame from a CSV File

In [32]:
covid_19_us_colleges = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/refs/heads/master/colleges/colleges.csv')

In [33]:
covid_19_us_colleges.head()

Unnamed: 0,date,state,county,city,ipeds_id,college,cases,cases_2021,notes
0,2021-05-26,Alabama,Madison,Huntsville,100654,Alabama A&M University,41,,
1,2021-05-26,Alabama,Montgomery,Montgomery,100724,Alabama State University,2,,
2,2021-05-26,Alabama,Limestone,Athens,100812,Athens State University,45,10.0,
3,2021-05-26,Alabama,Lee,Auburn,100858,Auburn University,2742,567.0,
4,2021-05-26,Alabama,Montgomery,Montgomery,100830,Auburn University at Montgomery,220,80.0,


2. Create dataset from a list of dictionaries

In [19]:
data = [{
  "id": 1,
  "first_name": "Mandi",
  "last_name": "Purchall",
  "email": "mpurchall0@blog.com",
  "gender": "Female",
  "ip_address": "186.170.193.201"
}, {
  "id": 2,
  "first_name": "Lee",
  "last_name": "Chaudrelle",
  "email": "lchaudrelle1@vk.com",
  "gender": "Female",
  "ip_address": "74.129.110.226"
}, {
  "id": 3,
  "first_name": "Guthrie",
  "last_name": "Roseburgh",
  "email": "groseburgh2@patch.com",
  "gender": "Male",
  "ip_address": "237.144.100.166"
}]

In [34]:
people_dict = pd.DataFrame(data, index=['p1','p2','p3'])
people_dict

Unnamed: 0,id,first_name,last_name,email,gender,ip_address
p1,1,Mandi,Purchall,mpurchall0@blog.com,Female,186.170.193.201
p2,2,Lee,Chaudrelle,lchaudrelle1@vk.com,Female,74.129.110.226
p3,3,Guthrie,Roseburgh,groseburgh2@patch.com,Male,237.144.100.166


also you can specify the order columns if this is important for you

In [25]:
cat_data = ['first_name','last_name','gender','email']
cont_data = ['id', 'ip_address']
new_order = (cat_data + cont_data)
people_json_order = pd.DataFrame(data, columns=new_order)
people_json_order

Unnamed: 0,first_name,last_name,gender,email,id,ip_address
0,Mandi,Purchall,Female,mpurchall0@blog.com,1,186.170.193.201
1,Lee,Chaudrelle,Female,lchaudrelle1@vk.com,2,74.129.110.226
2,Guthrie,Roseburgh,Male,groseburgh2@patch.com,3,237.144.100.166


# 3. Exercises

---

## **Exercise Set: Creating and Persisting DataFrames**  
### **Dataset: Tracking Covid-19 at U.S. Colleges and Universities**  
📌 *Make sure you have downloaded the dataset before running the exercises.*  

📄 **Source:** [New York Times Covid-19 Tracker](https://raw.githubusercontent.com/nytimes/covid-19-data/refs/heads/master/colleges/colleges.csv)  

---

### **Exercise 1: Creating DataFrames from Lists and Tuples**  
**Objective:** Learn how to create DataFrames manually using lists and tuples.  

##### **Task:**  
1. Create a DataFrame manually using the following list of lists:  
   ```python
   data = [
       ['2021-02-26', 'California', 'Los Angeles', 'Los Angeles', 110635, 'University of California, Los Angeles', 1256, 340],
       ['2021-02-26', 'New York', 'New York', 'New York', 190150, 'New York University', 2301, 450],
       ['2021-02-26', 'Texas', 'Harris', 'Houston', 226152, 'University of Houston', 980, 210]
   ]
   columns = ['date', 'state', 'county', 'city', 'ipeds_id', 'college', 'cases', 'cases_2021']
   ```
   Convert this data into a Pandas DataFrame.  
2. Transform the same dataset into a list of tuples and create a DataFrame from it.  
3. Display the first 3 rows of the DataFrame.  

💡 **Hint:** Use `pd.DataFrame()` with `columns` argument.  

---

### **Exercise 2: Creating DataFrames from an External Source**  
**Objective:** Load data into a DataFrame from an external CSV source.  

##### **Task:**  
1. Load the dataset from the given URL into a Pandas DataFrame.  
2. Display the first 10 rows of the dataset.  
3. Use `.info()` to check the structure of the DataFrame.  

💡 **Hint:** Use `pd.read_csv()`.  

---

### **Exercise 3: Saving and Loading CSV Files**  
**Objective:** Learn how to persist data using CSV format.  

##### **Task:**  
1. Save the DataFrame as `covid_colleges_backup.csv`.  
2. Reload the CSV file into a new DataFrame and confirm it matches the original.  
3. Load only the first 100 rows from the CSV file into a DataFrame.  

💡 **Hint:** Use `df.to_csv()` and `pd.read_csv()`.  

---