# Tinder Analysis

**TODO**:

    - Missing values 
    
    - Data preprocessing?
    
    - Statistics: https://tinderinsights.com/
    
    - Drop redundant messagesReceived and messagesSent 
    
    - education and educationLevel. What is the difference??????
    
    
**Questions:**

    - How many people have finished college? Of the ones that have finished college, is there any sex bias?
    
    - What is the minimum, mean and maximum percentage of one message-conversations for every sex?
    
    - Who uses more instagram by sex?
    
    - More used emojis by sex
    
    - Number of matches by sex
    
    - Number of matches by day of the week
    
    - 

# Problem description

Describe the problem, including any required background, and explain why you believe it is important / interesting.


No description of the data is given

# Data collection

Where was the data obtained?

# Mongo DB data load

Relational vs. No-SQL, which one makes more sense?

In [3]:
import json
import pymongo
from pymongo import MongoClient

client = MongoClient('localhost', 27017, username='mongoadmin', password='pass1234')

# Get the list of DBs already defined
print(client.list_database_names())

# Create a new DB - bda
database = client['bda']

# Create a collection
tinder = database.tinder

# Load the data
with open('tinder_dataset/profiles_2021-11-10.json', 'r') as f:
    data = json.load(f)

['admin', 'bda', 'config', 'local']


In [4]:
# Load the data
result = tinder.insert_many(data)
tinder.count_documents({})

1209

In [5]:
# Check if data was loaded correctly. Only one key is displayed because of the extremely long output
tinder.find_one()[list(tinder.find_one().keys())[-2]]

{'birthDate': '1976-01-01T00:00:00.000Z',
 'ageFilterMin': 21,
 'ageFilterMax': 35,
 'cityName': 'Trondheim',
 'country': 'Norway',
 'createDate': '2016-01-01T09:30:07.551Z',
 'education': 'Has high school and/or college education',
 'gender': 'M',
 'interestedIn': 'F',
 'genderFilter': 'F',
 'instagram': False,
 'spotify': False,
 'jobs': [],
 'educationLevel': 'Has high school and/or college education',
 'schools': []}

# Data description

Get to know the data. The following section seeks to **describe the data. It should be taken into account that not a single description or explanation was given. Hence, an analysis should be driven in order to perfectly understand the data at hand, allowing the posterior analysis of the data and the extraction of valuable information**. Thus, the data is analysed field by field and each one gets described thanks to the experiments carried out and the actual use of the application in order to discover the concrete meaning of each one.

In [6]:
# Get list of keys
tinder.find_one().keys()

dict_keys(['_id', '__v', 'appOpens', 'conversations', 'conversationsMeta', 'matches', 'messages', 'messagesReceived', 'messagesSent', 'swipeLikes', 'swipePasses', 'swipes', 'user', 'userId'])

**Explore all keys one by one**

In [7]:
# Get an idea of how id's are
cursor = tinder.find_one({}, {'_id': 1})
print(cursor)

# Get all possible values of id's
cursor = tinder.distinct('_id')
print(f"Number of different id's: {len(cursor)}")

{'_id': '00b74e27ad1cbb2ded8e907fcc49eaaf'}
Number of different id's: 1209


**From what can be extracted, '_id' is a unique and anonymous identifier for each instance of the dataset.** Moreover, __v is a versionKey that contains information about the internal revision of the document so it's not remarkable for the current analysis.

In [8]:
# Get an idea of how appOpens is. Only some entries are displayed because of the extremely long output
cursor = tinder.find_one({}, {'appOpens': 1})
dict(list(cursor['appOpens'].items())[0:10])

{'2016-01-02': 26,
 '2016-01-13': 10,
 '2016-01-15': 28,
 '2016-01-17': 18,
 '2016-01-19': 15,
 '2016-01-20': 2,
 '2016-01-22': 16,
 '2016-01-30': 17,
 '2016-01-31': 17,
 '2016-02-01': 7}

From what can be extracted, **'appOpens' refers to the number of times a user opens the app by date**. The information is stored in a dictionary where the key is the date.

In [9]:
# Get an idea of how conversations are
cursor = tinder.find_one({}, {'conversations': 1})
cursor['conversations'][0:2]

[{'match_id': 'Match 739',
  'messages': [{'to': 738,
    'from': 'You',
    'sent_date': 'Sun, 04 Aug 2019 12:50:22 GMT'},
   {'to': 738, 'from': 'You', 'sent_date': 'Fri, 09 Aug 2019 19:39:31 GMT'},
   {'to': 738, 'from': 'You', 'sent_date': 'Sun, 11 Aug 2019 12:14:55 GMT'}]},
 {'match_id': 'Match 738',
  'messages': [{'to': 737,
    'from': 'You',
    'sent_date': 'Sat, 03 Aug 2019 23:35:18 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Sun, 04 Aug 2019 12:01:16 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Sun, 04 Aug 2019 14:47:34 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Sun, 04 Aug 2019 15:53:06 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Sun, 04 Aug 2019 21:02:01 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Mon, 05 Aug 2019 06:15:28 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Mon, 05 Aug 2019 06:42:17 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Mon, 05 Aug 2019 06:42:53 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Mon, 05 Aug 2

In [10]:
# Get all possible values for the sender of the messages
cursor = tinder.distinct('conversations.messages.from')
cursor

['You']

From what can be extracted,  **'conversations' refers to messages sent by the user considered. The information is stored in a list of dictionaries where every dictionary stores a conversation with a match.**

In [11]:
# Get an idea of how conversationsMeta is
cursor = tinder.find_one({}, {'conversationsMeta': 1})
cursor['conversationsMeta']

{'nrOfConversations': 739,
 'longestConversation': 133,
 'longestConversationInDays': 683.5574421296296,
 'averageConversationLength': 8.56021650879567,
 'averageConversationLengthInDays': 10.236619931839824,
 'medianConversationLength': 3,
 'medianConversationLengthInDays': 0.08113425925925925,
 'nrOfOneMessageConversations': 226,
 'percentOfOneMessageConversations': 30.581867388362653,
 'nrOfGhostingsAfterInitialMessage': 66}

**Since some of the field's interpretations are not clear and no instructions or explanations were given with the data, the first mission of a data scientist is to get to understand the data at hand. Hence, let's check if the guessed meaning for each field is correct or not by computing the information through the conversations themself.**  

In [53]:
# Get conversations from a sample and get its statictics
cursor = tinder.find_one({}, {'conversations': 1})

# Number of conversations
nrOfConversations = len(cursor['conversations'])

# Longest conversation
lens = []
for i in range(len(cursor['conversations'])):
     lens.append(len(cursor['conversations'][i]['messages']))
longestConversation = max(lens)

# Longest conversation in days
import datetime
differences = []
for match in cursor['conversations']:
    try:
        d1 = datetime.datetime.strptime(match['messages'][0]['sent_date'], '%a, %d %b %Y %H:%M:%S GMT')
        d2 = datetime.datetime.strptime(match['messages'][-1]['sent_date'], '%a, %d %b %Y %H:%M:%S GMT')
        diff = (d2 - d1).total_seconds() / 60 / 60 / 24
        differences.append(diff)
    except:
        pass
longestConversationInDays = max(differences)

# Number of ghostings after initial message
nrOfGhostingsAfterInitialMessage = 0
for i in range(len(cursor['conversations'])):
    if len(cursor['conversations'][i]['messages']) == 0:
        nrOfGhostingsAfterInitialMessage+=1
        
print(f"nrOfConversations: {str(nrOfConversations)}\nlongestConversation: {str(longestConversation)}\nlongestConversationInDays: {str(longestConversationInDays)}\nnrOfGhostingsAfterInitialMessage: {str(nrOfGhostingsAfterInitialMessage)}")

nrOfConversations: 739
longestConversation: 133
longestConversationInDays: 683.5574421296296
nrOfGhostingsAfterInitialMessage: 66


From what can be extracted, **'conversationsMeta' refers to the metadata of the messages sent by the user considered. The information is stored in a dictionary where the following data is found:**

- **nrOfConversations**: Total number of conversations held
- **longestConversation**: Length of the longest conversation
- **longestConversationInDays**: Length of the longest conversation considering the days passed since the first and last messages.
- **averageConversationLength**: Average length of the conversations held
- **averageConversationLengthInDays**: Average length of the conversations considering the days passed since the first and last messages.
- **medianConversationLength**:  Median length of the conversations held
- **medianConversationLengthInDays**: Median length of the conversation considering the days passed since the first and last messages.
- **nrOfOneMessageConversations**: Total number of conversations consisting of just one message
- **percentOfOneMessageConversations**: Percentage of one message conversations as *nrOfOneMessageConversations/nrOfConversations*.
- **nrOfGhostingsAfterInitialMessage**: Total number of times where a first message received (a match is starting the conversation) is not replied to by the user.

In [54]:
# Get an idea of how matches is
cursor = tinder.find_one({}, {'matches': 1})
dict(list(cursor['matches'].items())[0:10])

{'2016-01-02': 10,
 '2016-01-13': 5,
 '2016-01-15': 9,
 '2016-01-17': 8,
 '2016-01-19': 6,
 '2016-01-20': 13,
 '2016-01-22': 4,
 '2016-01-30': 14,
 '2016-01-31': 2,
 '2016-02-01': 12}

From what can be extracted, **'matches' refers to the number of total matches a user get by date**. The information is stored in a dictionary where the key is the date.

In [55]:
# Get an idea of how messages are
cursor = tinder.find_one({}, {'messages': 1})

In [56]:
dict(list(cursor['messages']['sent'].items())[0:10])

{'2016-01-02': 11,
 '2016-01-13': 2,
 '2016-01-15': 1,
 '2016-01-17': 5,
 '2016-01-19': 8,
 '2016-01-20': 10,
 '2016-01-22': 2,
 '2016-01-30': 12,
 '2016-01-31': 6,
 '2016-02-01': 5}

In [57]:
dict(list(cursor['messages']['received'].items())[0:10])

{'2016-01-02': 12,
 '2016-01-13': 3,
 '2016-01-15': 13,
 '2016-01-17': 0,
 '2016-01-19': 11,
 '2016-01-20': 6,
 '2016-01-22': 7,
 '2016-01-30': 2,
 '2016-01-31': 0,
 '2016-02-01': 7}

From what can be extracted, **'messages' refers to the number of total messages a user sends or receives by date**. The information is stored in a dictionary where the keys are 'sent' and 'received'. In a similar way, each key refers to a dictionary where the key is the date and the value of the number of messages.

In [58]:
# Get an idea of how messagesReceived and messagesSent are
cursor = tinder.find_one({}, {'messagesReceived': 1, 'messagesSent': 1, 'messages': 1, })

# Check if messagesReceived and messsages contain the same information
equal_received = True
for key in cursor['messagesReceived'].keys():
    if cursor['messages']['received'][key] != cursor['messagesReceived'][key]:
        equal_received = False

# Check if messagesSent and messsages contain the same information
equal_sent = True
for key in cursor['messagesSent'].keys():
    if cursor['messages']['sent'][key] != cursor['messagesSent'][key]:
        equal_sent = False      
        
print(f"messages and messagesReceived contained the same information: {equal_received}")
print(f"messages and messagesSent contained the same information: {equal_sent}")

messages and messagesReceived contained the same information: True
messages and messagesSent contained the same information: True


**As can be seen,  'messagesReceived', 'messagesSent' and 'messages' contain the same information and thus,  'messagesReceived', 'messagesSent'  can be deleted since it is redundant information.**

In [59]:
# Get an idea of how swipePasses are
cursor = tinder.find_one({}, {'swipes': 1})

In [60]:
dict(list(cursor['swipes']['likes'].items())[0:10])

{'2016-01-02': 50,
 '2016-01-13': 70,
 '2016-01-15': 21,
 '2016-01-17': 5,
 '2016-01-19': 29,
 '2016-01-20': 92,
 '2016-01-22': 27,
 '2016-01-30': 77,
 '2016-01-31': 84,
 '2016-02-01': 76}

In [61]:
dict(list(cursor['swipes']['passes'].items())[0:10])

{'2016-01-02': 14,
 '2016-01-13': 93,
 '2016-01-15': 75,
 '2016-01-17': 96,
 '2016-01-19': 71,
 '2016-01-20': 38,
 '2016-01-22': 76,
 '2016-01-30': 56,
 '2016-01-31': 21,
 '2016-02-01': 49}

From what can be extracted, **'swipes' refers to the number of total swipes a user performs. The information is stored in a dictionary where the keys are 'likes' and 'passes', each one referring to the swipes for people the user likes and for people the user doesn't like respectively.** Similarly, each key refers to a dictionary where the key is the date and the value of the number of swipes.

In [62]:
# Get an idea of how swipeLikes are and swipePasses are
cursor = tinder.find_one({}, {'swipeLikes': 1, 'swipePasses': 1, 'swipes': 1, })

# Check if swipeLikes and swipes contain the same information
equal_likes = True
for key in cursor['swipeLikes'].keys():
    if cursor['swipes']['likes'][key] != cursor['swipeLikes'][key]:
        equal_received = False

# Check if swipePasses and swipes contain the same information
equal_passes = True
for key in cursor['swipePasses'].keys():
    if cursor['swipes']['passes'][key] != cursor['swipePasses'][key]:
        equal_sent = False      
        
print(f"swipeLikes and swipes contained the same information: {equal_likes}")
print(f"swipePasses and swipes contained the same information: {equal_passes}")

swipeLikes and swipes contained the same information: True
swipePasses and swipes contained the same information: True


**As can be seen,  'swipeLikes', 'swipePasses' and 'swipes' contain the same information and thus,  'swipeLikes', 'swipePasses'  can be deleted since it is redundant information.**

In [63]:
# Get an idea of how user is
cursor = tinder.find_one({'user.jobs': {'$ne': []}}, {'user': 1})
cursor['user']

{'birthDate': '1996-11-10T00:00:00.000Z',
 'ageFilterMin': 18,
 'ageFilterMax': 27,
 'createDate': '2017-11-17T23:30:37.231Z',
 'education': 'Has no high school or college education',
 'gender': 'M',
 'interestedIn': 'F',
 'genderFilter': 'F',
 'instagram': True,
 'spotify': False,
 'jobs': [{'companyDisplayed': False,
   'titleDisplayed': True,
   'title': 'Research Assistant'}],
 'educationLevel': 'Has no high school or college education',
 'schools': [{'displayed': True, 'name': 'Humboldt-Universität zu Berlin'}]}

In [88]:
# Get different values for education
print(tinder.distinct('user.education'))

# Get different values for education level
print(f"{tinder.distinct('user.educationLevel')}\n")

# Check if user.education and user.educationLevel contain the same information
equal_education = True
cursor = tinder.find({}, {'user': 1})
for user in cursor:
    if user['user']['education'] != user['user']['educationLevel']:
        equal_education = False
print(f"user.education and user.educationLevel contained the same information: {equal_education}")

['Has high school and/or college education', 'Has no high school or college education']
['Has high school and/or college education', 'Has no high school or college education']

user.education and user.educationLevel contained the same information: True


As can be seen, 'user.education' and 'user.educationLevel'contain the same information and thus, 'user.educationLevel', 'swipePasses' can be deleted since it is redundant information.

In [70]:
# Get different values for gender
tinder.distinct('user.gender')

['', 'F', 'M']

In [69]:
# Get different values for gender filter
tinder.distinct('user.genderFilter')

['', 'F', 'M', 'M and F']

From what can be extracted,  **'user' refers to the personal data of the user considered. The information is stored in a dictionary** where the following data is found:

- **birthDate**
- **ageFilterMin**: Minimum age parameter for profiles displayed to the user.
- **ageFilterMax**: Maximum age parameter for profiles displayed to the user.
- **createDate**: Profile creation date
- **education**: Whether the profile has or has not high school or college education.
- **gender**: M and F as possible values
- **interestedIn**: Gender the profile is interested in. M, F or M and F.
- **genderFilter**: Gender parameter for profiles displayed to the user.
- **instagram**: Whether the profile links to an Instagram profile or not
- **spotify**: Whether the profile links to a Spotify profile or not
- **jobs**: Dictionary with job information containing: companyDisplayed (whether the company is displayed or not), titleDisplayed (whether the job title is displayed or not) and title.
- **educationLevel**: Whether the profile has or has not high school or college education.
- **schools**: Dictionary with school information containing: displayed (whether the school name is displayed or not), and name.

In [64]:
# Check if '_id' is equal to 'userId'
tinder.distinct('_id') == tinder.distinct('userId')

True

**'_id' and 'userId' store the same information. Thus, 'userId' can be deleted since it is redundant information.**

In [71]:
# Get subkeys for interesting fields
cursor = tinder.find_one()

keys = []
for key in cursor.keys():
    
    if type(cursor[key]) == dict and key in ['conversationsMeta', 'swipes', 'user']:
        for nested_key in cursor[key].keys():
            keys.append(key + '.' + nested_key)
    else:
        keys.append(key)
        
print(keys)

['_id', '__v', 'appOpens', 'conversations', 'conversationsMeta.nrOfConversations', 'conversationsMeta.longestConversation', 'conversationsMeta.longestConversationInDays', 'conversationsMeta.averageConversationLength', 'conversationsMeta.averageConversationLengthInDays', 'conversationsMeta.medianConversationLength', 'conversationsMeta.medianConversationLengthInDays', 'conversationsMeta.nrOfOneMessageConversations', 'conversationsMeta.percentOfOneMessageConversations', 'conversationsMeta.nrOfGhostingsAfterInitialMessage', 'matches', 'messages', 'messagesReceived', 'messagesSent', 'swipeLikes', 'swipePasses', 'swipes.likes', 'swipes.passes', 'user.birthDate', 'user.ageFilterMin', 'user.ageFilterMax', 'user.cityName', 'user.country', 'user.createDate', 'user.education', 'user.gender', 'user.interestedIn', 'user.genderFilter', 'user.instagram', 'user.spotify', 'user.jobs', 'user.educationLevel', 'user.schools', 'userId']


# Data exploration

In [99]:
cursor = tinder.find({"user.instagram": True}, {'user': 1, 'conversationsMeta': 1, 'matches': 1}).limit(1)

## Missing values

In [72]:
for key in keys:
    try: 
        distint_values = tinder.distinct(key)
        if '' in distint_values: print(f"'' in {key}")
        if None in distint_values: print(f"None in {key}")
        if 'none' in distint_values: print(f"none in {key}")
        if 'null' in distint_values: print(f"null in {key}")
        if 'Null' in distint_values: print(f"Null in {key}")
        if 'void' in distint_values: print(f"void in {key}")
        if 'Void' in distint_values: print(f"Void in {key}")
        if '*' in distint_values: print(f"* in {key}")
    except: 
        pass

'' in user.cityName
* in user.cityName
'' in user.country
'' in user.gender
'' in user.interestedIn
'' in user.genderFilter


In [89]:
# Get all possible values of occupation
cursor = tinder.distinct('user.genderFilter')
list(cursor)

['', 'F', 'M', 'M and F']

In [9]:
# Get all possible values of occupation
cursor = tinder.distinct('user.educationLevel')
list(cursor)

['Has high school and/or college education',
 'Has no high school or college education']

In [13]:
cursor = tinder.find({"user.cityName": ''},
                     {'user': 1, 'conversationsMeta': 1, 'matches': 1})
list(cursor)

[{'_id': '7060673d6a49675f985a98f463bb0350',
  'conversationsMeta': {'nrOfConversations': 0,
   'longestConversation': 0,
   'longestConversationInDays': 0,
   'averageConversationLength': 0,
   'averageConversationLengthInDays': 0,
   'medianConversationLength': 0,
   'medianConversationLengthInDays': 0,
   'nrOfOneMessageConversations': 0,
   'percentOfOneMessageConversations': 0,
   'nrOfGhostingsAfterInitialMessage': 0},
  'matches': {'2021-02-19': 0,
   '2021-02-20': 0,
   '2021-02-21': 0,
   '2021-02-23': 0,
   '2021-02-24': 0,
   '2021-03-24': 0,
   '2021-04-12': 3,
   '2021-04-13': 0,
   '2021-04-14': 0},
  'user': {'birthDate': '1900-11-11T00:00:00.000Z',
   'ageFilterMin': 29,
   'ageFilterMax': 1000,
   'cityName': '',
   'country': '',
   'createDate': '2021-02-19T09:23:23.940Z',
   'education': 'Has no high school or college education',
   'gender': 'M',
   'interestedIn': 'F',
   'genderFilter': 'F',
   'instagram': False,
   'spotify': False,
   'jobs': [{'companyDisplay

# Data analysis

# Conclusions

# Close connection with database

In [2]:
# Drop table
tinder.drop()

# Close connection
client.close()