#**Guided Lab - 343.3.2 - Creating a Pandas DataFrame**


**Lab Objective:**

In this lab, we will demonstrate how to create a Pandas Dataframe, a fundamental data structure in data analysis with Python.

**Importance:** Mastering DataFrame creation is crucial for data manipulation, analysis, and visualization in Python. It's the foundation for working with data in Pandas.

**Learning Objective:**

 By the end of this lab, you will be able to create DataFrames using various methods, including dictionaries, lists, and NumPy arrays.

**Prerequisite/Equipment**
- Python environment: Python related IDE(e.g., Anaconda, Jupyter Notebook).
- NumPy library installed (can be installed via pip or conda).

**Submission**
- Submit your completed lab using the Start Assignment button on the assignment page in Canvas.
- Your submission can be include:
  - if you are using notebook then, all tasks should be written and submitted in a single notebook file, for example(**your_name_labname.ipynb**).
  - if you are using python script file, all tasks should be written and submitted in a single python script file for example: **(your_name_labname.py)**.
- Add appropriate comments and any additional instructions if required.

**Instructions:**

You can start by importing pandas along with NumPy, which you will use throughout the following examples:
```
import numpy as np
import pandas as pd
```




That’s it. Now you’re ready to create some DataFrames.


**Example 1: Creating a Pandas DataFrame from Dictionaries**

We can create a Pandas DataFrame with a Python dictionary:


In [2]:
import numpy as np
import pandas as pd
d = {'x': [1, 2, 3], 'y': [2, 4, 8], 'z': 100}
pd.DataFrame(d)

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


The keys of the dictionary are the DataFrame’s column labels, and the dictionary values are the data values in the corresponding DataFrame columns.

The values can be contained in a tuple, list, one-dimensional NumPy array, Pandas Series object, or one of several other data types. You can also provide a single value that will be copied along the entire column.

It’s possible to control the order of the columns with the columns parameter and row labels with index as shown in the below example:





In [None]:
pd.DataFrame(d, index=[100, 200, 300], columns=['z', 'y', 'x'])

Unnamed: 0,z,y,x
100,100,2,1
200,100,4,2
300,100,8,3


**Example 2.1: Creating a Pandas DataFrame from lists using zip() function**

We can also use the **`zip()`** function to zip together multiple lists to create a DataFrame with more columns.


In [3]:

import pandas as pd
# create a list of patientID, name, and date of birth and assign it to a variable
patientID = [101,23,48,49]
name =       ['alice','bob','charlie','Eric']
# create a list of dates
date_of_birth = ['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']
# Create a DataFrame using zip and pd.DataFrame
myDF = pd.DataFrame(zip(patientID, name,date_of_birth), columns=['patientID', 'name', 'date_of_birth'])
myDF



Unnamed: 0,patientID,name,date_of_birth
0,101,alice,2023-01-01
1,23,bob,2023-01-02
2,48,charlie,3/10/2020 143045
3,49,Eric,"13th of October, 2023"


**Explanation:**
- `zip(patientID, name, date_of_birth):` The zip() function combines the elements from the three lists into tuples. Each tuple represents a row of data, associating a patient ID, name, and date of birth.
- `pd.DataFrame(...):` This creates a Pandas DataFrame using the output of zip() as the data source.
- `columns=['patientID', 'name', 'date_of_birth']:` This argument sets the column names for the DataFrame.

**Example 2.2: Creating a Pandas DataFrame from List using Dictionary**

- Another way to create a Pandas DataFrame is to use a **list** of **dictionaries**:
- To use lists in a dictionary to create a Pandas DataFrame, we Create a dictionary of lists and then Pass the dictionary to the pd.DataFrame() constructor. Optionally, we can specify the column names for the DataFrame by passing a list of strings to the columns parameter of the pd.DataFrame() constructor.




In [None]:

l = [{'x': 1, 'y': 2, 'z': 100},
     {'x': 2, 'y': 4, 'z': 100},
     {'x': 3, 'y': 8, 'z': 100}]

pd.DataFrame(l)

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


Again, the dictionary keys are the column labels, and the dictionary values are the data values in the DataFrame.

You can also use a **nested list,** or a **list of lists**, as the data values. If you do, then it is wise to explicitly specify the labels of columns, rows, or both when you create the DataFrame.


In [5]:
l = [[1, 2, 100],
     [2, 4, 100],
     [3, 8, 100]]

pd.DataFrame(l, columns=['x', 'y', 'z'])



Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


That is how you can use a nested list to create a Pandas DataFrame. You can also use a list of tuples in the same way. To do so, just replace the nested lists in the example above with tuples.


**Example 2.3: Creating Pandas using Lists**

In [None]:
stocks = ["IBM", "APPLE", "TWTTR", "GE", "MSFT"]
prices = [115.00, 119.14, 19.77, 25.99, 26]

pd.DataFrame(zip(stocks, prices), columns=['stocks', 'prices'])


**Example 3: Creating a pandas DataFrame from NumPy Arrays**

You can pass a two-dimensional NumPy array to the DataFrame constructor the same way you do with a list:

In [None]:
# This following line creates a NumPy array named arr.
arr = np.array([[1, 2, 100],[2, 4, 100],[3, 8, 100]])
# This following line creates a Pandas DataFrame named df and
df = pd.DataFrame(arr, columns=['x', 'y', 'z'])
df


Although this example looks almost the same as the nested list implementation above, it has one advantage. You can specify the optional parameter copy.

When a ***copy*** is set to ***False*** (its default setting), the data from the NumPy array is not copied. This means that the original data from the array is assigned to the Pandas DataFrame. If you modify the array, your DataFrame will change too:




In [4]:
arr = np.array([[1, 2, 100],[2, 4, 100],[3, 8, 100]])
# This following line creates a Pandas DataFrame named df and
df = pd.DataFrame(arr, columns=['x', 'y', 'z'])

arr[0, 0] = 1000
df


Unnamed: 0,x,y,z
0,1000,2,100
1,2,4,100
2,3,8,100


Note: Not copying data values can save you a significant amount of time and processing power when working with large datasets.


If this behavior is not what you want, you should specify copy=True in the DataFrame constructor. That way, df will be created with a copy of the values from arr instead of the actual values.


**TJs Example**

Completely Interactive Data Frame Maker

In [12]:
import pandas as pd

def parse_value(val):
    try:
        # Try to convert to int
        return int(val)
    except ValueError:
        try:
            # Try to convert to float
            return float(val)
        except ValueError:
            # Leave as string
            return val.strip()

def create_dataframe_interactively():
    data_dict = {}
    while True:
        key = input("Enter column name (or type 'done' to finish): ")
        if key.lower() == 'done':
            break
        values = input(f"Enter values for '{key}' (comma-separated): ")
        parsed_values = [parse_value(v.strip()) for v in values.split(',')]
        data_dict[key] = parsed_values

    return pd.DataFrame(data_dict)

# Example usage
df = create_dataframe_interactively()
print(df)

def create_dataframe_interactively():
    data_dict = {}
    while True:
        key = input("Enter column name (or type 'done' to finish): ")
        if key.lower() == 'done':
            break
        values = input(f"Enter values for '{key}' as comma-separated numbers (e.g., 1,2,3): ")
        data_dict[key] = [int(v.strip()) for v in values.split(',')]
    
    return pd.DataFrame(data_dict)

# Example usage
df = create_dataframe_interactively()
print(df)


      Name  Age
0       TJ   20
1  Carlton   30
2       AJ   49


ValueError: invalid literal for int() with base 10: ''