# Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [127]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

This file is a .tsv file. We can import a .tsv (Tab-separated values) file into a Jupyter Notebook using the pandas library in Python. Pandas provides a convenient method to read TSV files using pd.read_csv() function by specifying the delimiter parameter (delimiter='\t').

### Step 3. Assign it to a variable called chipo.

In [128]:
chipo = pd.read_csv(r"https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv", delimiter='\t')

### Step 4. See the first 10 entries

In [129]:
chipo.head(20)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


### Step 5. What is the number of observations in the dataset?

I use .info() which provides a concise summary of the DataFrame, including the number of non-null entries for each column.
- It also displays the total number of rows (entries) in the DataFrame.
- This method is straightforward for quickly checking the number of rows.

In [130]:
# Solution 1
chipo.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB


I use .describe() which provides summary statistics for numerical columns by default (int64 and float64 types).
It does not directly give you the count of rows; instead, it focuses on statistical information like mean, min, max, etc., for numerical data.

In [131]:
# Solution 2
chipo.describe()


Unnamed: 0,order_id,quantity
count,4622.0,4622.0
mean,927.254868,1.075725
std,528.890796,0.410186
min,1.0,1.0
25%,477.25,1.0
50%,926.0,1.0
75%,1393.0,1.0
max,1834.0,15.0


### Step 6. What is the number of columns in the dataset?

Using the method info() I can see that the number of columns is 5

![image.png](attachment:image.png)

I wonder if there is another way to see the number :

### The .columns attribute :

Method 1: Using .columns attribute
We can directly use the .columns attribute of the DataFrame to get a list of column names, and then find the length of this list to determine the number of columns.

In [132]:
number_of_columns = len(chipo.columns)

In [133]:
number_of_columns

5

### The .shape attribute:

Method 2: Using shape attribute
The shape attribute of a pandas DataFrame returns a tuple representing the dimensions of the DataFrame (rows, columns). You can directly access the second element of this tuple to get the number of columns. ( num_columns = chipo.shape[1]) --->The first (0) represents the number of rows.

In [134]:
# Use .shape attribute to get the number of columns (second element of the tuple)
num_columns = chipo.shape

In [135]:
num_columns

(4622, 5)

![image.png](attachment:image.png)

### Step 7. Print the name of all the columns.

In [136]:
column_names = chipo.columns

### 1. Printing column_names as an Index object:

In [137]:
print(column_names)

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')


 ---> The output will display the column names encapsulated within an Index object

### 2. Iterating through column_names and printing each column name:

In [138]:
for column in column_names:
    print(column)

order_id
quantity
item_name
choice_description
item_price


--->In this case, each column name (column) is printed individually in separate lines due to the for loop iterating over each element of column_names.

### Step 8. How is the dataset indexed?

There several ways to see how the dataset is indexed

### 1.Viewing the Index Information:



In [139]:
print(chipo.index)

RangeIndex(start=0, stop=4622, step=1)


This output indicates a default RangeIndex starting from 0 to 4622 (inclusive).

### 2. Using .info() Method:

In [140]:
print(chipo.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB
None


### Step 9. Which was the most-ordered item? 

The most ordered item can be found from the quantity column.How can I find the one that is the most ordered one? Do I use max()

In [141]:
chipo

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
...,...,...,...,...,...
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75


We need columns "ItemName" and "Quantity".

Step 1 --->Group by "itemname" then sum the quantity of each to see which one is most ordered

In [142]:
item_quantities  = chipo.groupby('item_name')['quantity'].sum()

In [143]:
item_quantities

item_name
6 Pack Soft Drink                         55
Barbacoa Bowl                             66
Barbacoa Burrito                          91
Barbacoa Crispy Tacos                     12
Barbacoa Salad Bowl                       10
Barbacoa Soft Tacos                       25
Bottled Water                            211
Bowl                                       4
Burrito                                    6
Canned Soda                              126
Canned Soft Drink                        351
Carnitas Bowl                             71
Carnitas Burrito                          60
Carnitas Crispy Tacos                      8
Carnitas Salad                             1
Carnitas Salad Bowl                        6
Carnitas Soft Tacos                       40
Chicken Bowl                             761
Chicken Burrito                          591
Chicken Crispy Tacos                      50
Chicken Salad                              9
Chicken Salad Bowl                       123


Now, we can find the most ordered item :

In [144]:
most_ordered_item = item_quantities.idxmax()
print("The most ordered item is :", most_ordered_item)


The most ordered item is : Chicken Bowl


### Step 10. For the most-ordered item, how many items were ordered?

In [145]:
max_quantity = item_quantities.max()
print("The quantity of the most ordered item is: ", max_quantity)

The quantity of the most ordered item is:  761


### Step 11. What was the most ordered item in the choice_description column?

In [146]:
chipo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB


From the choice description I notice there are 3376 values meaning there Nan values are present. I need to find a way to remove these NAN values before processing the column choice_description .

### Data Cleaning and Preparation

1. Handling NaN Values in choice_description
Let's clean up the choice_description column by removing rows where it is NaN.

In [147]:
# Drop rows where "choice description" is Nan

chipo_choice_column_cleaned = chipo.dropna(subset="choice_description")

#### 2 .After dropping the Nan values using dropna() method we need to reset Index to ensure it starts from ZERO

In [148]:
chipo_choice_column_cleaned.reset_index(drop=True, inplace=True)

#### 3. Display info to check the cleaned DataFrame:

In [149]:
print("Cleaned DataFrameinfo after the clean up :")
chipo_choice_column_cleaned

Cleaned DataFrameinfo after the clean up :


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Izze,[Clementine],$3.39
1,1,1,Nantucket Nectar,[Apple],$3.39
2,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
3,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
4,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
...,...,...,...,...,...
3371,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
3372,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
3373,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
3374,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75


Clean up seems to be successful since the rows were reduced from 4622 to 3376 meaning tha rows with Nan Values were exluded:

After cleaning up NaN values from the choice_description column, we proceed with further processing:

#### 4.Convert Values in choice_description to Strings and Strip Unwanted Characters

- String Conversion ---> In order to ensure consistency and be able to perform string operations onto this column, because sometimes when we read files from a csv values that are in lists can be read as strings in Python. 
- Stripping Characters---> In the Dataset , the column "choice_description" contains values wrapped in [] ,need to remove this we use str.strip("[]") 

In [150]:
chipo_choice_column_cleaned["choice_description"] = chipo_choice_column_cleaned["choice_description"].astype(str).str.strip("[]")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chipo_choice_column_cleaned["choice_description"] = chipo_choice_column_cleaned["choice_description"].astype(str).str.strip("[]")


#### 5.Next we remove empty strings that might have resulted from stripping :

### Why do we do this?

Data Integrity: During the cleaning process, some values might end up as empty strings ("") if they were originally just brackets with no content ([]). By filtering out these empty strings, we ensure that our dataset only contains meaningful data that can be further analyzed.

In [151]:
# Remove any empty strings that might have resulted from stripping
chipo_choice_column_cleaned= chipo_choice_column_cleaned[chipo_choice_column_cleaned['choice_description'] != ""]


#### Find the most ordered item :

In [152]:
most_ordered_choice = chipo_choice_column_cleaned["choice_description"].value_counts().idxmax()
most_ordered_count = chipo_choice_column_cleaned['choice_description'].value_counts().max()

In [153]:
 # Print the most ordered item
print(f"The most ordered item in the choice_description column is '{most_ordered_choice}' with a total of {most_ordered_count} orders.")

The most ordered item in the choice_description column is 'Diet Coke' with a total of 134 orders.


### Step 12. How many items were orderd in total?

Maybe we should convert the column "Quantity" into a list first :

In [154]:
quantity_to_list = chipo_choice_column_cleaned["item_name"].to_list()

In [155]:
quantity_to_list 

['Izze',
 'Nantucket Nectar',
 'Chicken Bowl',
 'Chicken Bowl',
 'Steak Burrito',
 'Steak Soft Tacos',
 'Steak Burrito',
 'Chicken Crispy Tacos',
 'Chicken Soft Tacos',
 'Chicken Bowl',
 'Chicken Burrito',
 'Chicken Burrito',
 'Canned Soda',
 'Chicken Bowl',
 'Barbacoa Burrito',
 'Nantucket Nectar',
 'Chicken Burrito',
 'Izze',
 'Chicken Bowl',
 'Carnitas Burrito',
 'Canned Soda',
 'Chicken Burrito',
 'Steak Burrito',
 'Carnitas Bowl',
 'Chicken Soft Tacos',
 'Chicken Soft Tacos',
 'Barbacoa Bowl',
 'Chicken Bowl',
 'Steak Burrito',
 'Chicken Salad Bowl',
 'Chicken Burrito',
 'Steak Burrito',
 'Izze',
 'Steak Burrito',
 'Steak Burrito',
 'Canned Soda',
 'Chicken Burrito',
 'Canned Soda',
 'Steak Bowl',
 'Barbacoa Soft Tacos',
 'Veggie Burrito',
 'Barbacoa Bowl',
 'Steak Soft Tacos',
 'Veggie Bowl',
 'Chicken Burrito',
 'Steak Burrito',
 'Steak Soft Tacos',
 'Izze',
 'Steak Burrito',
 'Chicken Burrito',
 'Steak Burrito',
 'Steak Burrito',
 'Chicken Burrito',
 'Chicken Soft Tacos',
 'Chi

We have the list of all items in the "item_list" column, now we can find the length of that list :

In [156]:
number_of_items = len(quantity_to_list)

In [157]:
print("Number of items ordered is: ", number_of_items)

Number of items ordered is:  3376


This shows the number of items, not the sum of ordered items. In order to find the total nuber of items order i need to sum the quantities of all items.Before doing so , I need to prepare the column, check if all values are numeric (check the data type of the column ) and if neccessary convert it into a numeric one.

1. Check the datatype of the "quantity" column.

In [158]:
print(chipo_choice_column_cleaned['quantity'].dtype)

int64


### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [159]:
chipo.item_price.dtype

dtype('O')

#### Step 13.b. Create a lambda function and change the type of item price

Need to remember how to create a lambda function...

So...A lamda function is an anonumous function that can take multiple arguments but only one expression and looks like this:
![image.png](attachment:image.png) and we use the apply() method to apply the L function to the item_price column :



In [160]:
chipo['item_price'] = chipo['item_price'].apply(lambda x: float(x))

ValueError: could not convert string to float: '$2.39 '

In [None]:
print("DataFrame after converting 'item_price' to float:")
print(chipo)

DataFrame after converting 'item_price' to float:
      order_id  quantity                              item_name  \
0            1         1           Chips and Fresh Tomato Salsa   
1            1         1                                   Izze   
2            1         1                       Nantucket Nectar   
3            1         1  Chips and Tomatillo-Green Chili Salsa   
4            2         2                           Chicken Bowl   
...        ...       ...                                    ...   
4617      1833         1                          Steak Burrito   
4618      1833         1                          Steak Burrito   
4619      1834         1                     Chicken Salad Bowl   
4620      1834         1                     Chicken Salad Bowl   
4621      1834         1                     Chicken Salad Bowl   

                                     choice_description item_price  
0                                                   NaN     $2.39   
1      

After applying the lambda function I get error message : ![image.png](attachment:image.png)

![image.png](attachment:image.png)

##### The error message indicates that there values with the $ sign in the column and cannot be converted directly to float.We need to convert the dollar sign and any leading/trailing space before converting these values in float type:

#### Detailed Steps :
- Convert the column to string type.Ensure all values are strings

In [None]:
chipo['item_price'] = chipo['item_price'].astype(str)

- strip any leading/trailing white space.Clean the string values :

In [None]:
chipo['item_price'] = chipo['item_price'].str.strip()

- remove the dollar sign from the strings 

In [None]:
chipo['item_price'] = chipo['item_price'].str.replace("$","")

## Debbuging Result

![image.png](attachment:image.png)

In [None]:
print("\nAfter converting to string, stripping whitespace, and removing dollar sign:")
print(chipo)


After converting to string, stripping whitespace, and removing dollar sign:
      order_id  quantity                              item_name  \
0            1         1           Chips and Fresh Tomato Salsa   
1            1         1                                   Izze   
2            1         1                       Nantucket Nectar   
3            1         1  Chips and Tomatillo-Green Chili Salsa   
4            2         2                           Chicken Bowl   
...        ...       ...                                    ...   
4617      1833         1                          Steak Burrito   
4618      1833         1                          Steak Burrito   
4619      1834         1                     Chicken Salad Bowl   
4620      1834         1                     Chicken Salad Bowl   
4621      1834         1                     Chicken Salad Bowl   

                                     choice_description item_price  
0                                                

#### Step 13.c. Check the item price type

In [None]:
chipo.item_price.dtype

dtype('O')

This indicates that the data type is an object which means it contains strings or a mix of different data types that Pandas is treating as strings hence they can easily be categorised (As lists or dictionaries) so it defaults to treating those values as objects.I have already removed the dollar sign and any trailing/leading whitspaces from strings. I pressume  non numeric characters are present in the column item_price (like commas or unexpected characters) that prevent Pandas reading them as numeric or there is a mixture of numeric and non numeric values.

#### Troubleshooting steps


1. Ensure Data Cleaniness 

In [None]:
#check data type
print(chipo.dtypes)

order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object


2. Identify Missing Values: Check for missing values (NaN, None, empty strings).



In [None]:
print(chipo.isnull().sum())


order_id                 0
quantity                 0
item_name                0
choice_description    1246
item_price               0
dtype: int64


3. Make sure all is numeric:

In [162]:
# Convert 'item_price' to float

chipo['item_price'] = pd.to_numeric(chipo['item_price'].str.replace('$', ''), errors='coerce')

pd.to_numeric(chipo['item_price'], errors='coerce'): This function converts the 'item_price' column to numeric values. The errors='coerce' parameter ensures that any values that cannot be converted to numeric will be set as NaN (Not a Number).

####  Handling Errors:
errors='coerce': If there are any unexpected non-numeric values left after cleaning, using 'coerce' will turn those values into NaN. You can then decide how to handle these NaN values based on your analysis needs (e.g., dropping rows with NaN values or filling them with a default value).

Check the type after conversion : 

In [163]:
# Check the dtype after conversion
print(chipo['item_price'].dtype)

float64


### Step 14. How much was the revenue for the period in the dataset?

In [169]:
revenue = (chipo['quantity']* chipo['item_price']).sum()

print('Revenue was: $' + str(np.round(revenue,2)))

Revenue was: $39237.02


### Step 15. How many orders were made in the period?

### Step 16. What is the average revenue amount per order?

In [87]:
# Solution 1



In [88]:
# Solution 2



### Step 17. How many different items are sold?