I imported the libraries below because it is essential to work with any code. Otherwise, it would not be possible.

In [2]:
import pandas as pd
import numpy as np
import statistics as stats
import matplotlib.pyplot as plt
import seaborn as sns

I used the methods below to read and show the dataset.

In [3]:
df = pd.read_csv('All_Diets.csv')
df.head()

Unnamed: 0,Diet_type,Recipe_name,Cuisine_type,Protein(g),Carbs(g),Fat(g),Extraction_day,Extraction_time
0,paleo,Bone Broth From 'Nom Nom Paleo',american,5.22,1.29,3.2,2022-10-16,17:20:09
1,paleo,"Paleo Effect Asian-Glazed Pork Sides, A Sweet ...",south east asian,181.55,28.62,146.14,2022-10-16,17:20:09
2,paleo,Paleo Pumpkin Pie,american,30.91,302.59,96.76,2022-10-16,17:20:09
3,paleo,Strawberry Guacamole recipes,mexican,9.62,75.78,59.89,2022-10-16,17:20:09
4,paleo,"Asian Cauliflower Fried ""Rice"" From 'Nom Nom P...",chinese,39.84,54.08,71.55,2022-10-16,17:20:09


I used the data.info() method to obtain the types of data I have in this dataset.
The result showed that the first three columns are object type, meaning that the data type is a string, that is, a categorical type of data. The columns Protein, Carbs, and Fat are presented as float64, meaning they are given by decimal numbers with high precision compared to the 32-bite system.  The last two columns are called object data type, but it is not considered a data type in Python. 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7806 entries, 0 to 7805
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Diet_type        7806 non-null   object 
 1   Recipe_name      7806 non-null   object 
 2   Cuisine_type     7806 non-null   object 
 3   Protein(g)       7806 non-null   float64
 4   Carbs(g)         7806 non-null   float64
 5   Fat(g)           7806 non-null   float64
 6   Extraction_day   7806 non-null   object 
 7   Extraction_time  7806 non-null   object 
dtypes: float64(3), object(5)
memory usage: 488.0+ KB


### Cleaning data

Now I am using the code df.isnull().sum() to calculate the numbers of missing (null and NaN) values in all columns of the dataset. As result I found that there is no missing values.

In [5]:
df.isnull().sum()

Diet_type          0
Recipe_name        0
Cuisine_type       0
Protein(g)         0
Carbs(g)           0
Fat(g)             0
Extraction_day     0
Extraction_time    0
dtype: int64

The method miss_val_formats call a list of strings that have some possible formats that might be used to indicate missing or invalid data in the dataset. Then the code below will search on the entire dataset for all the data that are in the list 'miss_val_formats' which may represent missing or invalid data and place them with 'NaN', which means not a number.
The result confirms that there is not missing or invalid data.

In [29]:
miss_val_formats = ["n.a.", "?", "NA", "n/a", "na", "--"]
df = pd.read_csv("All_Diets.csv", na_values=miss_val_formats)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7806 entries, 0 to 7805
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Diet_type        7806 non-null   object 
 1   Recipe_name      7806 non-null   object 
 2   Cuisine_type     7806 non-null   object 
 3   Protein(g)       7806 non-null   float64
 4   Carbs(g)         7806 non-null   float64
 5   Fat(g)           7806 non-null   float64
 6   Extraction_day   7806 non-null   object 
 7   Extraction_time  7806 non-null   object 
dtypes: float64(3), object(5)
memory usage: 488.0+ KB


### Numerical Summaries

The Numerical summaries describe characteristics of the data such as distribuition using: mean, median, mode, quartiles, variance, standard deviation, minimum and maximum values.

### Mean:
Refers to the average of a set of numbers. It is calculated by adding up all the values and dividing by the number of values.(Nicholas, 1999, p.1)

I used the .describe() function to give numerical summary results for the numerical columns.
Analysing this table, I can see that these three columns present skewed data because the comparison between mean and median (50% values) is not similar. Thus, this data does not have a normal distribution.

In [32]:
df.describe()

Unnamed: 0,Protein(g),Carbs(g),Fat(g)
count,7806.0,7806.0,7806.0
mean,83.231498,152.123189,117.328542
std,89.797282,185.907322,122.098117
min,0.0,0.06,0.0
25%,24.415,36.1625,41.0675
50%,56.28,93.415,84.865
75%,112.3575,205.915,158.29
max,1273.61,3405.55,1930.24


### Mean Function: 
This first part of a function is similar to the other functions. We have to instanciate the function with the *def* word followed by name of the function (calc_mean) and variable between parenthesis, where there will pass the columns name.

The second part is for check if the length of the variable is equal to 0, if so the list is empty and the function returns None. But if the length is not 0, it proceeds to the next part, the mean calculation.

To calculate the mean, the sum() function of the variable over the number of elements in the variable, which is the length of the variable. Then, the function returns the calculated mean.

In [46]:
def calc_mean(variable):
    if len(variable) == 0:
        return None
    mean = sum(variable) / len(variable)
    return mean

In [47]:
calc_mean(df['Protein(g)'])

83.23149756597437

### Median Function:
Median is the middle number and it is calculated by putting the numbers in order and taking the actual middle number if there is one, or the average of the two middle numbers if not. (Nicholas, 1999, p.1)

The initial part is sorting the variable list in ascending order using the sorted() function. Then, the variable 'n' is assigned the length of the sorted list and the number of elements in the list.

The if part of the code checks if 'n' is an even or odd number. If 'n' is divisible by, there are two middle elements, and in this case, the mean between these two elements will be the median. However, if 'n' is not divisible by two, there is a unique middle element, which will be the median.

Then, the function will return the median value for a specific variable.

In [72]:
def calc_median(variable):
    sorted_variable = sorted(variable)
    n = len(sorted_variable)

    if n % 2 == 0:
        mid1 = sorted_variable[n // 2 - 1]
        mid2 = sorted_variable[n // 2]
        median = (mid1 + mid2) / 2
    else:
        median = sorted_variable[n // 2]
    return median

In [87]:
calc_median(df['Protein(g)'])

56.28

### Mode Function:
Mode is defined as the most commonly occurring number.(Nicholas, 1999, p.1)

The first part of this fuction is for check if the length of the variable is equal to 0, if so the list is empty and the function returns None. But if the length is not 0, it proceeds to the next part.

Secondly, a dictionary 'num_freq' stores the frequency of each unique number in the variable.

The *for* part verifies if each number found is already a key in the dictionary ('num_freq'). If the value is associated with the key, it is added by one; if not, a new key is associated, and the initial count is one.

The next step is to find the highest frequency in the variable, and for it, the 'max_count' function is used.

Then, the code 'mode' checks if the frequency of each number matches with the 'max_count'; if so, the numbers are added to the 'mode' list. 

The last part compares if the mode's length is equal to the length of the variable. If they are similar, it means that the elements have the same frequency, so there are no elements that appear more frequently, and there is no mode.

In [75]:
def calc_mode(variable):
    if len(variable) == 0:
        return None
    num_freq = {}
   
    for num in variable: 
        if num in num_freq:
            num_freq[num] += 1
        else:  
            num_freq[num] = 1
    
    max_count = max(num_freq.values())
    mode = [num for num, count in num_freq.items() if count == max_count]
    
    if len(mode) == len(variable):
        return "No mode found"
    return mode

In [88]:
calc_mode(df['Protein(g)'])

[0.0]

### References

Nicholas, J. (1999). *Introduction to Descriptive Statistics*. Sydney: Mathematics Learning Centre, University Of Sydney. 