# Mini Project

### 1. What data do you have?

- We have data on the following questions from the survey:
    18. How many units are you taking currently?
    54. What is your current living situation?
    55. How do you commute to school?
    56. Do you have a UCR dining plan?
    57. How many times a week do you purchase food or drinks from somewhere on campus?
    58. What items do you purchase?
    59. What is the biggest reason you do not purchase more food and drinks on campus?

Each column consists of one of these questions, while each row consists of an individual’s response to each question. Essentially, the data we have relates to the student’s amount of units they are taking, their living situation and if they commute, whether or not they purchase food on campus, and if they don’t what is the reason why.

### 2. What would you like to know?

We would like to know if the amount of units and living situation a student has affects how they get food on campus. We are trying to determine if a student’s total units affect if they buy food often if they have at and/or 12 units as opposed to students who are part time who take less than 12 units. We are also trying to determine if their living situation affects whether or not students purchase food on campus.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
import plotly.express as px
from scipy.stats import chi2_contingency

In [2]:
raw_data = pd.read_csv("CS105Fall2022.csv")
data = pd.DataFrame(raw_data[['18. How many units are you taking currently?',
               '54. What is your current living situation?',
               '55. How do you commute to school?',
               '56. Do you have a UCR dining plan?',
               '57. How many times a week do you purchase food or drinks from somewhere on campus?',
               '58. What items do you purchase?',
               '59. What is the biggest reason you do not purchase more food and drinks on campus?']])
data.rename(columns={'18. How many units are you taking currently?' : 'units',
               '54. What is your current living situation?': 'live situation',
               '55. How do you commute to school?' : 'commute',
               '56. Do you have a UCR dining plan?' : 'dining plan',
               '57. How many times a week do you purchase food or drinks from somewhere on campus?' : 'avg purchases',
               '58. What items do you purchase?' : 'items',
               '59. What is the biggest reason you do not purchase more food and drinks on campus?' : 'why nobuy'}, inplace=True)
data

Unnamed: 0,units,live situation,commute,dining plan,avg purchases,items,why nobuy
0,16.0,Off-Campus (3-mile radius),Walk,No,1,"Entrees, Water",Financial issues
1,12.0,Off-Campus (3-mile radius),Public Transportation,No,0,,Expensive compared to other options
2,16.0,Off-Campus (4-30 mile radius),Drive Yourself,No,0.5,Entrees,Not hungry
3,16.0,Off-Campus (3-mile radius),Public Transportation,No,5,"Entrees, Snacks, Desserts, Juice/Tea/Coffee",Not hungry
4,17.0,Off-Campus (4-30 mile radius),Drive Yourself,No,3,"Entrees, Snacks, Juice/Tea/Coffee",Not hungry
...,...,...,...,...,...,...,...
101,17.0,On Campus,Walk,Yes,1,"Snacks, Soda",Financial issues
102,16.0,Off-Campus (3-mile radius),Walk,No,4,Entrees,Prefer cooking/packing
103,12.0,Off-Campus (4-30 mile radius),Drive Yourself,No,1,Entrees,Lack of options
104,12.0,Off-Campus (4-30 mile radius),Drive Yourself,No,5,Entrees,Lack of options


#### Cleaning

In [3]:
data.loc[data['avg purchases'].isna()]

Unnamed: 0,units,live situation,commute,dining plan,avg purchases,items,why nobuy
31,12.0,Off-Campus (3-mile radius),Walk,No,,Snacks,No Time
36,,,,,,,
38,12.0,,,,,,
88,,,,,,,
91,15.0,Off-Campus (3-mile radius),,,,,


In [4]:
# Manually Drop and Fill nans
na_data = data.loc[data['avg purchases'].isna()]
data.drop(index=na_data.index.values[1:], inplace=True)
data['avg purchases'] = data['avg purchases'].fillna('0')
data['items'] = data['items'].fillna('')

In [5]:
# data.loc[data['avg purchases'].isna()]

In [6]:
# Manually Fixing Non-numeric Values
non_num_data = data.loc[data['avg purchases'].apply(str).str.isnumeric() == False]
display(non_num_data)
data.loc[2, 'avg purchases'] = 0.5  # str to float
data.drop(index=10, inplace=True)  # invalid data
data.loc[43, 'avg purchases'] = (2 * 7)  # 2x per day per week
data.loc[78, 'avg purchases'] = 1  # ~1 => 1

Unnamed: 0,units,live situation,commute,dining plan,avg purchases,items,why nobuy
2,16.0,Off-Campus (4-30 mile radius),Drive Yourself,No,0.5,Entrees,Not hungry
10,16.0,Off-Campus (4-30 mile radius),Drive Yourself,No,every day,"Snacks, Soda, Water",Lack of options
43,12.5,Off-Campus (4-30 mile radius),Public Transportation,Yes,twice a day,"Entrees, Juice/Tea/Coffee",Lack of options
78,16.5,Off-Campus (4-30 mile radius),Drive Yourself,No,~1,"Entrees, Snacks, Fruits/Salads, Water",low quality fast food


In [7]:
# data.loc[data['avg purchases'].apply(str).str.isnumeric() == False]

In [8]:
# Convert avg purchases to numerical
data['avg purchases'] = data['avg purchases'].astype('float64')
data.dtypes

units             float64
live situation     object
commute            object
dining plan        object
avg purchases     float64
items              object
why nobuy          object
dtype: object

In [9]:
# Remove non-valid (outlier) unit counts
non_valid_unit = data.loc[data['units'] > 28.0]
display(non_valid_unit)
data.drop(index=non_valid_unit.index.values, inplace=True)

Unnamed: 0,units,live situation,commute,dining plan,avg purchases,items,why nobuy
69,128.5,Off-Campus (4-30 mile radius),Drive Yourself,No,1.0,Snacks,No Time


In [10]:
data.loc[data['units'] > 28.0]

Unnamed: 0,units,live situation,commute,dining plan,avg purchases,items,why nobuy


In [11]:
# Filter and drop invalid responses:
    # if avg purchases != 0.0, but items = 0.0

### 3. Explore the data. (Generate statistics, perform visualizations) 

Explain what you are computing (mean, SD, ...), and then compute using Python.

#### Data Extraction

In [12]:
# splitting item purchased list
indv_items = pd.DataFrame(range(0, np.shape(data)[0]), columns=['indx'])
indv_items['items'] = pd.DataFrame(data['items'].apply(lambda x: x.split(', ')))

# convert list into appropriate columns and data values to True/False
indv_items = pd.pivot_table(indv_items.explode('items'), values='items', index='indx', columns='items', aggfunc=lambda x: True)
indv_items = indv_items.fillna(False)
indv_items = indv_items.rename(columns={'' : 'None'})
indv_items

items,None,Desserts,Entrees,Fruits/Salads,Juice/Tea/Coffee,Snacks,Soda,Water
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,False,False,True,False,False,False,False,True
1,True,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False
3,False,True,True,False,True,True,False,False
4,False,False,True,False,True,True,False,False
...,...,...,...,...,...,...,...,...
95,False,False,True,False,False,True,True,False
96,False,False,True,False,False,False,False,False
97,True,False,False,False,False,False,False,False
98,False,False,True,False,False,False,False,False


In [13]:
items_none = indv_items['None']
items_none

indx
0     False
1      True
2     False
3     False
4     False
      ...  
95    False
96    False
97     True
98    False
99    False
Name: None, Length: 94, dtype: bool