## Section 1: Foundational Concepts

---



#### 1. __What is the primary purpose of data manipulation in the data science workflow?__


  _Data manipulation aims to clean and structure messy data sets, to make them readable, relevant, and usable for analysis and decision-making._

#### 2. **Name three Python libraries that are used for data manipulation, besides Pandas.**

-  _NumPy_
- _Polars_
- _PySpark_

#### 3. **What are the key limitations of standard Python lists when performing numerical computations with large datasets, which NumPy addresses?**

- _They are slower in terms of performance - Python list require explicit looping to perform mathematical operations like addition and multiplication on elements, which is slow due to Python's interpreted nature. Where as NumPy uses optimized compact based implementations under the hood, allowing vectorized operations, which are significantly faster._
- _Consumes more memory - Python lists store pointers to objects, which leads to higher memory usage and slower access times.Where as NumPy arrays store elements in a contiguous block of memory using fixed-size data types, which is more memory-efficient and supports faster access._
- _Poor functionality to mathematical operations - Lists don’t support element-wise mathematical operations natively, for example, list1 + list2 concatenates rather than adds elements, whereas Numpy arrays does._

#### 4. **Explain the core difference between how Python lists and NumPy arrays handle the addition of two lists/arrays.**

*The core difference between Python lists and NumPy arrays when handling addition lies in their fundamental behavior. When you use the `+` operator with Python lists, it performs concatenation, combining two lists into one longer list.  
For example, adding `[1, 2, 3]` and `[4, 5, 6]` yields `[1, 2, 3, 4, 5, 6]`.*

In [None]:
list_one = [1, 2, 3]
list_two = [4, 5, 6]
print(list_one + list_two)

[1, 2, 3, 4, 5, 6]


 *To perform element-wise addition with lists, you must manually iterate through them using loops or list comprehensions.*



In [None]:
result = []

for i in range(len(list_one)):
    result.append(list_one[i] + list_two[i])

print(result)

[5, 7, 9]


*In contrast, NumPy arrays treat the `+` operator as a mathematical operation, performing element-wise addition by default. Adding two NumPy arrays like `[1, 2, 3]` and `[4, 5, 6]` produces `[5, 7, 9]`, where each corresponding element is summed. This behavior stems from NumPy's design for numerical computing, where vectorized operations eliminate the need for explicit loops and significantly improve performance.*

In [None]:
import numpy as np
ndarray_one = np.array(list_one)
ndarray_two = np.array(list_two)

print(ndarray_one + ndarray_two)

[5 7 9]


#### 5. **What is an ndarray**

_**ndarray**, short for N-dimensional array, is the fundamental data structure in NumPy that represents a fixed-size, homogeneous, multi-dimensional array of elements._

_It is NumPy’s powerful, efficient, and flexible container for numerical data, enabling fast computations and advanced math operations._


_To create a ndarray  we use the `array()` method from numpy as show:_


In [None]:
example_ndarray = np.array([1,2,3,4,5])

print(example_ndarray)

#Confirm the type of the ndarray
print(type(example_ndarray))

[1 2 3 4 5]
<class 'numpy.ndarray'>


#### 6. **List the advantages of using NumPy arrays over Python lists for numerical computations.**

- _Consumes less memory - they use fixed-size, homogeneous data types stored in a contiguous block of memory, reducing overhead and memory usage._
- _Fast as compared to the python List - Its operations are implemented in C and vectorized, making them much faster than looping through Python lists for arithmetic or mathematical computations._
- _Convenient to use - It allows for element-wise operations without writing explicit loops, making the code cleaner and easier to read._

#### 7. **What are the two primary data structures that Pandas introduces for working with tabular data? Briefly describe each.**

- _**Series** - Is a one-dimensional labeled array capable of holding any data type_
- _**DataFrame** - Is a two-dimensional labeled data structure, consisting of rows and columns, where each column is a Series.._

#### 8. **How does Pandas simplify working with tabular data compared to what NumPy primarily offers?**

It simplifies by:
- _Providing labeled indexing, allowing rows and columns to be accessed using names instead of only integer positions_
- _Supporting heterogeneous data types, enabling each column in a DataFrame to hold different types , unlike NumPy arrays which require uniform types._
- _Displaying data in a clear, tabular format, which improves readability and helps in quickly understanding the structure and content._
- _Handling missing data gracefully, using NaN placeholders and offering built-in functions like `isnull()`, `fillna()`, and `dropna()` for data cleaning._

#### 9. **Name two ways to access a specific column in a Pandas DataFrame. Provide a simple example for each.**

- _**Bracket notation** - It uses square brackets to access the column name as a key e.g `df['Age']`._
- _**Dot notation** - It accesses the column as an attribute using a dot e.g `df.Age`_

#### 10. **What does `df.isnull().sum()` tell you about a DataFrame in Pandas, and why is this function important in data cleaning?**

_`df.isnull().sum()` returns the total number of null records in every column of the dataframe._

##Section 2: Foundational Data Exploration & Manipulation

---



In [None]:
import pandas as pd

#Loading the dataset
df = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv", sep='\t')

#### 1. **Display the last 7 rows of the DataFrame df to get an initial look at the data.**

In [None]:
df.tail(7)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
4615,1832,1,Chicken Soft Tacos,"[Fresh Tomato Salsa, [Rice, Cheese, Sour Cream]]",$8.75
4616,1832,1,Chips and Guacamole,,$4.45
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75
4621,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$8.75


#### 2. **Print a summary of the DataFrame including the index dtype and column dtypes, non-null values, and memory usage. Comment on what you see.**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB


*The summary of the dataframe shows that:*
- _The dataset has a total of 4622 records._
- _It has a total of 5 columns including the order_id, quantity, item_name, choice_description, and item_price._
- _All the columns in the dataset has no null values except in the choice_description  column which has 1246 null values._
- _The first two columns, order_id and quantity, are of integer values, while the other three are of non integer values._
- _The memory usage of the dataset is around 180.7 KB_

#### 3. **Determine how many observations have missing item_price in the dataset.**

In [None]:
df.item_price.isnull().sum()

np.int64(0)

_None of the observations have missing item_price in the dataset_

#### 4. **Print the names of all columns in the DataFrame.**

In [None]:
df.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')


#### 5. **Identify which item_name appears most frequently in the orders.**

In [None]:
#Number of times each item_name appear
item_name_count = df.item_name.value_counts()
print(item_name_count)

#The highest count is the first item, since they are in ascending order
print(item_name_count[0])

item_name
Chicken Bowl                             726
Chicken Burrito                          553
Chips and Guacamole                      479
Steak Burrito                            368
Canned Soft Drink                        301
Chips                                    211
Steak Bowl                               211
Bottled Water                            162
Chicken Soft Tacos                       115
Chips and Fresh Tomato Salsa             110
Chicken Salad Bowl                       110
Canned Soda                              104
Side of Chips                            101
Veggie Burrito                            95
Barbacoa Burrito                          91
Veggie Bowl                               85
Carnitas Bowl                             68
Barbacoa Bowl                             66
Carnitas Burrito                          59
Steak Soft Tacos                          55
6 Pack Soft Drink                         54
Chips and Tomatillo Red Chili Salsa       48


  print(item_name_count[0])


_The item_name appearing most is Chicken Bowl, appearing 726 times._

#### 6. **Determine how many distinct item_name values are present in the dataset.**

In [None]:
df.item_name.nunique()

50

#### 7. **Create a new DataFrame containing only the rows where the item_name is 'Chicken Bowl'.**

In [None]:
new_dataframe = df[df.item_name == "Chicken Bowl"]
print(new_dataframe)

      order_id  quantity     item_name  \
4            2         2  Chicken Bowl   
5            3         1  Chicken Bowl   
13           7         1  Chicken Bowl   
19          10         1  Chicken Bowl   
26          13         1  Chicken Bowl   
...        ...       ...           ...   
4590      1825         1  Chicken Bowl   
4591      1825         1  Chicken Bowl   
4595      1826         1  Chicken Bowl   
4599      1827         1  Chicken Bowl   
4604      1828         1  Chicken Bowl   

                                     choice_description item_price  
4     [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98   
5     [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...    $10.98   
13    [Fresh Tomato Salsa, [Fajita Vegetables, Rice,...    $11.25   
19    [Tomatillo Red Chili Salsa, [Fajita Vegetables...     $8.75   
26    [Roasted Chili Corn Salsa (Medium), [Pinto Bea...     $8.49   
...                                                 ...        ...  
4590  [Roast

#### 8. **The 'item_price' column is currently a string (e.g., '$2.39'). Convert it to a float type. You'll need to remove the dollar sign and then convert the data type.**

In [None]:
#Remove the $ sign prefixed in the item_price
df.item_price = df.item_price.str.strip('$')

#Convert the item_price into float
df.item_price = df.item_price.astype(float)


print(df.item_price.dtype)
df.head()

float64


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,2.39
1,1,1,Izze,[Clementine],3.39
2,1,1,Nantucket Nectar,[Apple],3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98


#### 9. **After converting 'item_price' to a numerical type, calculate the average item_price across all orders.**

In [None]:
print(df.item_price.mean())

7.464335785374297


#### 10. **Determine the total number of unique orders in the dataset based on the order_id column.**

In [None]:
#Determining the number of unique values in the order_id column
df.order_id.nunique()

1834

##Section 3: Optional; Extra Credit

---



#### 1. **Total Revenue Calculation (Pandas)**

Assuming each `item_price` (after conversion to float) is for a single quantity of that item, calculate the total revenue generated from all orders. Remember that the `quantity` column indicates how many of that `item_name` were ordered in that specific line item.

In [None]:
sum(df.item_price * df.quantity)

39237.020000000055