PROJECT GOALS:
1. Browse the dataset, inspect its shape, features, data types
2. Formulate valid questions to analyze through the available data

Question to analyze:
Build a model that predicts a zodiac sign based on the data

Steps:
1. Cleanup data
2. Select features
3. Select model
4. Select hyperparameters


In [1]:
import pandas as pd
import numpy as np

In [2]:
data_file = pd.read_csv('profiles.csv')
print('''###############################################
DataFrame info:''')
print(data_file.info())
print('''###############################################
DataFrame sample:''')
#data_file.head()
features_of_interest = ['body_type', 'job','diet', 'drinks', 'drugs', 'education', 'orientation','sex', 'smokes', 'status', 'religion', 'ethnicity']

#removing unwanted features
data_file = data_file.drop(columns=['essay0','essay1','essay2','essay3','essay4','essay5','essay6','essay7','essay8','essay9','last_online'])
for i in data_file.columns:
    uniques= data_file[i].nunique()
    print(f'{i} uniques: {uniques}')
# data_file.head()
    


###############################################
DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   essay0       54458 non-null  object 
 7   essay1       52374 non-null  object 
 8   essay2       50308 non-null  object 
 9   essay3       48470 non-null  object 
 10  essay4       49409 non-null  object 
 11  essay5       49096 non-null  object 
 12  essay6       46175 non-null  object 
 13  essay7       47495 non-null  object 
 14  essay8       40721 non-null  object 
 15  essay9       47343 non-null  object 
 16  ethnicity    54266 non-null  object 
 17  height       59943 non-n

Possible predictions:
1. Predict the zodiac sign
2. Predict sex
3. Predict religion
4. Predict drugs
5. Predict your ideal partner's age based on other criteria
6. Predict your ideal partner's religion based on other criteria

## DATA CLEANUP

AIs:
1. Handle nans
2. Check for outliers
3. Create bins/categories

Data Cleanup:



age          59946 non-null  int64  - check data distribution
body_type    54650 non-null  object - create categories, check distribution
diet         35551 non-null  object - bin data, check distribution
drinks       56961 non-null  object - create categories, check distribution
drugs        45866 non-null  object - create categories, check distribution
education    53318 non-null  object - bin data, check distribution
ethnicity    54266 non-null  object - ? too many combinations
height       59943 non-null  float64 - check data distribution
income       59946 non-null  int64  - check data distribution
job          51748 non-null  object - check data distribution
last_online  59946 non-null  object - disregard
location     59946 non-null  object - bin data, check distribution
offspring    24385 non-null  object - check data distribution, probably disregard due to high nan count
orientation  59946 non-null  object  - check data distribution
pets         40025 non-null  object  - check data distribution
religion     39720 non-null  object  - check data distribution
sex          59946 non-null  object  - check data distribution
sign         48890 non-null  object  - check data distribution
smokes       54434 non-null  object - create categories, check distribution
speaks       59896 non-null  object - bin data, check distribution
status       59946 non-null  object - check data distribution

In [10]:
for i in data_file.columns:
    nans = np.round(data_file[i].isna().sum() / len(data_file) * 100, 2)
    print(f'{i} nans: {nans}')

age nans: 0.0
body_type nans: 8.83
diet nans: 40.69
drinks nans: 4.98
drugs nans: 23.49
education nans: 11.06
ethnicity nans: 9.48
height nans: 0.01
income nans: 0.0
job nans: 13.68
location nans: 0.0
offspring nans: 59.32
orientation nans: 0.0
pets nans: 33.23
religion nans: 33.74
sex nans: 0.0
sign nans: 18.44
smokes nans: 9.19
speaks nans: 0.08
status nans: 0.0


In [None]:
# data.dropna(subset=['Height','Education'], #only looks at these two columns
#             inplace=True, #removes the rows and keeps the data variable
#             how='any') #removes data with missing data in either field

In [17]:
#inspecting high nan features:
#body_type - turn nans into new category
#diet - turn into "mostly anything"
#drinks - turn into "socially"
print(data_file.body_type.unique())

['a little extra' 'average' 'thin' 'athletic' 'fit' nan 'skinny' 'curvy'
 'full figured' 'jacked' 'rather not say' 'used up' 'overweight']
