# Exploratory Data Analysis for the Food Nutrients Data

## Description

Clean fetched FDC datasets contatining nutritional information about various products and explore them.

## Table of Contents

## Results summary

## Imports

In [1]:
import os

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

## Data source

The data was obtained from [Food Data Central](https://fdc.nal.usda.gov/) (Department of Agriculture, USA). The database contains nutritional information about various products together with the number of analyzed samples, basic statistics, analytical methods and others. Full details are available for "Foundation Foods" and not necessarily for other sections ("SR Legacy", for example). More complete decription of the database and its contents can be found in [Foundation Foods documentation](https://fdc.nal.usda.gov/docs/Foundation_Foods_Documentation_Apr2023.pdf).

## Outline of data being explored

There are 10 datasets stored in .csv files in /FoodDataAnalysis/data/raw/ directory.

Each of them contains data for particular kind of food (fish, vegetable, fruit etc.). \
Products in each food category are characterized by 7 variables if available:
- Energy (kcal)
- Protein (g)
- Fat (g)
- Carbohydrate (g)
- Potassium (mg)
- Calcium (mg)
- Magnesium (mg)

NOTE: Energy in the datasets corresponds to Atwater General Factors.

## Analysis

### Read data

Note: The data was initially manually corrected in Excel by removing inappropriate entries (frozen food, food from restaurants, processed food etc.). Descriptions had to be changed through deletion of unnecessary details, which was accomplished in Excel too, due to easiness of manipulation in individuall cells.

In [101]:
def load_data(dirpath):
    dfs = {}
    for filename in os.listdir(dirpath):
        if filename.endswith('.csv'):
            df_name = filename.split('.')[0]
            dfs[df_name] = pd.read_csv(os.path.join(dirpath,filename))
    return dfs

data = load_data('../data/raw/')

In [104]:
vegetable = data['vegetable']
vegetable = vegetable[~vegetable['Description'].str.contains('[Mm]ushroom')]
mushrooms = vegetable[vegetable['Description'].str.contains('[Mm]ushroom')]

fruit = data['fruit']
nut = data['nut']
cheese = data['cheese']
grain = data['grain']
fish = data['fish']
egg_and_milk = data['egg']

### Clean data

Vegetables

In [111]:
def clean_mushrooms_dataset(mushroom_df):
    df_copy = mushroom_df.copy()
    df_copy['Description'] = df_copy['Description'].apply(lambda x: reverse_names(x))
    return df_copy

In [112]:
def reverse_names(description):
    list_of_words = description.split(',')
    return ' '.join(list_of_words[1:]) + ' ' + list_of_words[0] 

In [116]:
def clean_vegetable_dataset(vegetable_df):
    df_copy = vegetable_df.copy()
    
    # Choose only relevant products
    # df_copy['Description'] = df_copy['Description'].apply(lambda x: str.lower(x))
    df_copy = df_copy[(df_copy['Description'].str.contains('raw'))]
    # df_copy['Description'] = df_copy['Description'].apply(lambda x: reverse_names(x))
    df_copy['Description'] = df_copy['Description'].str.replace(r'\b, raw\b','',regex=True)
    return df_copy
    

In [117]:
vegetables_cleaned = clean_vegetable_dataset(vegetable)
mushrooms_cleaned = clean_mushrooms_dataset(mushrooms)

In [119]:
mushrooms_cleaned

Unnamed: 0,Description,Energy,Protein,Total lipid (fat),"Carbohydrate, by difference","Potassium, K","Calcium, Ca","Magnesium, Mg"


Fruits

In [60]:
def clean_fruits_dataset(fruit_df):
    df_copy = fruit_df.copy()
    
    figs = df_copy.iloc[12]
    figs = pd.DataFrame([figs.values], columns=figs.index)
    df_copy = df_copy[(df_copy['Description'].str.contains('raw'))]
    df_copy = pd.concat([df_copy,figs],ignore_index=True)
    df_copy['Description'] = df_copy['Description'].apply(lambda x: str.lower(x))
    df_copy['Description'] = df_copy['Description'].apply(lambda x: x.replace(',','')).str.replace('\braw\b','',regex=True)
    return df_copy

In [62]:
fruits_cleaned = clean_fruits_dataset(fruit)

Cheese

In [97]:
def clean_cheese_dataset(cheese_df):
    df_copy = cheese_df.copy()
    df_copy = df_copy[~df_copy['Description'].str.contains('restaurant')]
    df_copy['Description'] = df_copy['Description'].str.lower()
    df_copy['Description'] = df_copy['Description'].apply(lambda x: ' '.join(x.split(',')[1:]))
    df_copy = df_copy.drop([7,17]) # Processed cheese products
    return df_copy

In [98]:
cheese_cleaned = clean_cheese_dataset(cheese)

Grain

In [99]:
def clean_grain_dataset(grain_df):
    df_copy = grain_df.copy()
    df_copy['Description'] = df_copy['Description'].apply(lambda x: str.lower(x)).apply(lambda x: reverse_names(x))
    return df_copy

In [100]:
grain_cleaned = clean_grain_dataset(grain)
grain_cleaned.head()

Unnamed: 0,Description,Energy,Protein,Total lipid (fat),"Carbohydrate, by difference","Potassium, K","Calcium, Ca","Magnesium, Mg"
0,whole grain buckwheat,356,11.1,3.04,71.1,414.0,13.6,203.0
1,whole grain millet,376,10.0,4.19,74.4,214.0,9.1,106.0
2,oat whole grain flour,389,13.2,6.31,69.9,373.0,42.8,125.0
3,spelt whole grain flour,364,14.5,2.54,70.7,350.0,30.0,124.0
4,whole grain steel cut oats,381,12.5,5.8,69.8,376.0,51.3,129.0


Bean

Nut

Fish

Egg

### EDA

## Conclusions