# Introduction

Dating apps have become a popular and widely-used mechanism for people to connect and build relationships in the digital age. They generate rich, diverse datasets that offer fascinating opportunities to explore patterns in human behavior, preferences, and interactions. Analyzing data from these platforms can provide unique insights into social dynamics and modern relationship trends.

The aim of this project is to define the scope, prepare and analyze data, and develop a machine learning model to address a specific question.


**Data sources:**

`profiles.csv` was provided by Codecademy.com.

## Scoping

Project Scoping is the helpful beginning of the project. It helps to recognize and line up the structure while requiring you to think through your entire project before you begin. 
Considering [Data Science Project Scoping Guide](https://www.datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/) it will be good to start with setting high-level goals of the project, determination which actions could be done or improved and data we need and where we can gather them, and, finally, the description of following analysis and which techniques could be implemented.

### Goals

The goal of this project is to apply the skills gained from Codecademy to a dataset using machine learning techniques. The main research question is whether an OkCupid user's astrological sign can be predicted based on other profile variables. This project holds significance because many users consider astrological signs important in matches, and if a user doesn't provide their sign, OkCupid would benefit from predicting it.

### Actions

As a project for a fictional customer there is no need to scope actions.

### Data

The project utilizes a dataset provided by Codecademy, named `profiles.csv`. Each row in the dataset represents an OkCupid user, with columns containing their responses to various profile questions, including multiple-choice and short answer questions.

### Analysis

This solution will employ descriptive statistics and data visualization to identify key figures that reveal the distribution, count, and relationships among variables. To achieve the project's goal of predicting users' astrological signs, classification algorithms from the supervised learning category of machine learning models will be utilized.

The project will end with an evaluation of the chosen machine learning model using a validation dataset. The results will be assessed through a confusion matrix, along with metrics including accuracy, precision, recall, F1 score, and Kappa score.

# Load data

In [14]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = [6, 6]
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

First we import neсessary python libraries. To work with csv data set we use Pandas. To evaluate and visualize our findings we may use Matplotlib and Seaborn.

In [16]:
profiles = pd.read_csv('profiles.csv', encoding='utf-8')
profiles.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [22]:
print(profiles.shape)

(59946, 31)


`profiles`has 59,946 rows and 31 columns

In [25]:
print(profiles.columns)

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')


The columns in the dataset include: 

- **age:** continuous variable of age of user
- **body_type:** categorical variable of body type of user
- **diet:** categorical variable of dietary information
- **drinks:**  categorical variable of alcohol consumption
- **drugs:** categorical variable of drug usage
- **education:** categorical variable of educational attainment
- **ethnicity:** categorical variable of ethnic backgrounds
- **height:** continuous variable of height of user
- **income:** continuous variable of income of user
- **job:** categorical variable of employment description
- **offspring:** categorical variable of children status
- **orientation:** categorical variable of sexual orientation
- **pets:** categorical variable of pet preferences
- **religion:** categorical variable of religious background
- **sex:** categorical variable of gender
- **sign:** categorical variable of astrological symbol
- **smokes:** categorical variable of smoking consumption
- **speaks:** categorical variable of language spoken
- **status:** categorical variable of relationship status
- **last_online:** date variable of last login
- **location:** categorical variable of user locations

And a set of open short-answer responses to :

- **essay0:** My self summary
- **essay1:**  What I’m doing with my life
- **essay2:** I’m really good at
- **essay3:** The first thing people usually notice about me
- **essay4:** Favorite books, movies, show, music, and food
- **essay5:** The six things I could never do without
- **essay6:** I spend a lot of time thinking about
- **essay7:** On a typical Friday night I am
- **essay8:** The most private thing I am willing to admit
- **essay9:** You should message me if…