# Tinder Analysis

**TODO**:

    - Missing values 
    
    - Data preprocessing?
    
    - Statistics: https://tinderinsights.com/
    
    
**Questions:**

    - How many people have finished college? Of the ones that have finished college, is there any sex bias?
    
    - What is the minimum, mean and maximum percentage of one message-conversations for every sex?
    
    - Who uses more instagram by sex?
    
    - More used emojis by sex
    
    - Number of matches by sex
    
    - Number of matches by day of the week
    
    - 

# Problem description

Describe the problem, including any required background, and explain why you believe it is important / interesting.


No description of the data is given

# Data collection

Where was the data obtained?

# Mongo DB data load

Relational vs. No-SQL, which one makes more sense?

In [3]:
import json
import pymongo
from pymongo import MongoClient

client = MongoClient('localhost', 27017, username='mongoadmin', password='pass1234')

# Get the list of DBs already defined
print(client.list_database_names())

# Create a new DB - bda
database = client['bda']

# Create a collection
tinder = database.tinder

# Load the data
with open('tinder_dataset/profiles_2021-11-10.json', 'r') as f:
    data = json.load(f)

['admin', 'bda', 'config', 'local']


In [4]:
# Load the data
result = tinder.insert_many(data)
tinder.count_documents({})

1209

In [5]:
# Check if data was loaded correctly. Only one key is displayed because of the extremely long output
tinder.find_one()[list(tinder.find_one().keys())[-2]]

{'birthDate': '1976-01-01T00:00:00.000Z',
 'ageFilterMin': 21,
 'ageFilterMax': 35,
 'cityName': 'Trondheim',
 'country': 'Norway',
 'createDate': '2016-01-01T09:30:07.551Z',
 'education': 'Has high school and/or college education',
 'gender': 'M',
 'interestedIn': 'F',
 'genderFilter': 'F',
 'instagram': False,
 'spotify': False,
 'jobs': [],
 'educationLevel': 'Has high school and/or college education',
 'schools': []}

# Data description

Get to know the data

In [6]:
# Get list of keys
tinder.find_one().keys()

dict_keys(['_id', '__v', 'appOpens', 'conversations', 'conversationsMeta', 'matches', 'messages', 'messagesReceived', 'messagesSent', 'swipeLikes', 'swipePasses', 'swipes', 'user', 'userId'])

**Explore all keys one by one**

In [7]:
# Get an idea of how id's are
cursor = tinder.find_one({}, {'_id': 1})
print(cursor)

# Get all possible values of id's
cursor = tinder.distinct('_id')
print(f"Number of different id's: {len(cursor)}")

{'_id': '00b74e27ad1cbb2ded8e907fcc49eaaf'}
Number of different id's: 1209


**From what can be extracted, '_id' is a unique and anonymous identifier for each instance of the dataset.** Moreover, __v is a versionKey that contains information about the internal revision of the document so it's not remarkable for the current analysis.

In [8]:
# Get an idea of how appOpens is. Only some entries are displayed because of the extremely long output
cursor = tinder.find_one({}, {'appOpens': 1})
dict(list(cursor['appOpens'].items())[0:10])

{'2016-01-02': 26,
 '2016-01-13': 10,
 '2016-01-15': 28,
 '2016-01-17': 18,
 '2016-01-19': 15,
 '2016-01-20': 2,
 '2016-01-22': 16,
 '2016-01-30': 17,
 '2016-01-31': 17,
 '2016-02-01': 7}

From what can be extracted, **'appOpens' refers to the number of times a user opens the app by date**. The information is stored in a dictionary where the key is the dates.

In [9]:
# Get an idea of how conversations are
cursor = tinder.find_one({}, {'conversations': 1})
cursor['conversations'][0:2]

[{'match_id': 'Match 739',
  'messages': [{'to': 738,
    'from': 'You',
    'sent_date': 'Sun, 04 Aug 2019 12:50:22 GMT'},
   {'to': 738, 'from': 'You', 'sent_date': 'Fri, 09 Aug 2019 19:39:31 GMT'},
   {'to': 738, 'from': 'You', 'sent_date': 'Sun, 11 Aug 2019 12:14:55 GMT'}]},
 {'match_id': 'Match 738',
  'messages': [{'to': 737,
    'from': 'You',
    'sent_date': 'Sat, 03 Aug 2019 23:35:18 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Sun, 04 Aug 2019 12:01:16 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Sun, 04 Aug 2019 14:47:34 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Sun, 04 Aug 2019 15:53:06 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Sun, 04 Aug 2019 21:02:01 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Mon, 05 Aug 2019 06:15:28 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Mon, 05 Aug 2019 06:42:17 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Mon, 05 Aug 2019 06:42:53 GMT'},
   {'to': 737, 'from': 'You', 'sent_date': 'Mon, 05 Aug 2

In [25]:
# Get all possible values for the sender of the messages
cursor = tinder.distinct('conversations.messages.from')
cursor

['You']

From what can be extracted,  **'conversations' refers to messages sent by the user considered. The information is stored in a list of dictionaries where every dictionary stores a conversation with a match.**

In [11]:
# Get an idea of how conversationsMeta is
cursor = tinder.find_one({}, {'conversationsMeta': 1})
cursor['conversationsMeta']

{'nrOfConversations': 739,
 'longestConversation': 133,
 'longestConversationInDays': 683.5574421296296,
 'averageConversationLength': 8.56021650879567,
 'averageConversationLengthInDays': 10.236619931839824,
 'medianConversationLength': 3,
 'medianConversationLengthInDays': 0.08113425925925925,
 'nrOfOneMessageConversations': 226,
 'percentOfOneMessageConversations': 30.581867388362653,
 'nrOfGhostingsAfterInitialMessage': 66}

**Since some of the field's interpretations are not clear and no instructions or explanations were given with the data, the first mission of a data scientist is to get to understand the data at hand. Hence, let's check if the guessed meaning for each field is correct or not by computing the information through the conversations themself.**  

In [46]:
# Get conversations from a sample and get its statictics
cursor = tinder.find_one({}, {'conversations': 1})

# Number of conversations
nrOfConversations = len(cursor['conversations'])

# Longest conversation
lens = []
for i in range(len(cursor['conversations'])):
     lens.append(len(cursor['conversations'][i]['messages']))
longestConversation = max(lens)

# Longest conversation in days
import datetime
differences = []
for match in matches:
    try:
        d1 = datetime.datetime.strptime(match[0]['sent_date'], '%a, %d %b %Y %H:%M:%S GMT')
        d2 = datetime.datetime.strptime(match[-1]['sent_date'], '%a, %d %b %Y %H:%M:%S GMT')
        diff = (d2 - d1).total_seconds() / 60 / 60 / 24
        differences.append(diff)
    except:
        pass
longestConversationInDays = max(differences)

# Number of ghostings after initial message
nrOfGhostingsAfterInitialMessage = 0
for i in range(len(cursor['conversations'])):
    if len(cursor['conversations'][i]['messages']) == 0:
        nrOfGhostingsAfterInitialMessage+=1
        
print(f"nrOfConversations: {str(nrOfConversations)}\nlongestConversation: {str(longestConversation)}\nlongestConversationInDays: {str(longestConversationInDays)}\nnrOfGhostingsAfterInitialMessage: {str(nrOfGhostingsAfterInitialMessage)}")

nrOfConversations: 739
longestConversation: 133
longestConversationInDays: 683.5574421296296
nrOfGhostingsAfterInitialMessage: 66


From what can be extracted, **'conversationsMeta' refers to the metadata of the messages sent by the user considered. The information is stored in a dictionary where the following data is found:**

- **nrOfConversations**: Total number of conversations held
- **longestConversation**: Length of the longest conversation
- **longestConversationInDays**: Length of the longest conversation considering the days passed since the first and last messages.
- **averageConversationLength**: Average length of the conversations held
- **averageConversationLengthInDays**: Average length of the conversations considering the days passed since the first and last messages.
- **medianConversationLength**:  Median length of the conversations held
- **medianConversationLengthInDays**: Median length of the conversation considering the days passed since the first and last messages.
- **nrOfOneMessageConversations**: Total number of conversations consisting of just one message
- **percentOfOneMessageConversations**: Percentage of one message conversations as *nrOfOneMessageConversations/nrOfConversations*.
- **nrOfGhostingsAfterInitialMessage**: Total number of times where a first message received (a match is starting the conversation) is not replied to by the user.

In [68]:
#cursor['conversations'] # LIST OF DICTS

In [23]:
cursor = tinder.find({"user.instagram": True},
                     {'user': 1, 'conversationsMeta': 1, 'matches': 1}).limit(1)
list(cursor)

[{'_id': '0eb998fdde77f9c123c07eace18a5cc1',
  'conversationsMeta': {'nrOfConversations': 809,
   'longestConversation': 444,
   'longestConversationInDays': 198.39097222222222,
   'averageConversationLength': 6.8244746600741655,
   'averageConversationLengthInDays': 1.924995893993499,
   'medianConversationLength': 3,
   'medianConversationLengthInDays': 0.09177083333333333,
   'nrOfOneMessageConversations': 296,
   'percentOfOneMessageConversations': 36.58838071693449,
   'nrOfGhostingsAfterInitialMessage': 13},
  'matches': {'2017-11-17': 0,
   '2017-11-18': 1,
   '2017-11-19': 1,
   '2017-11-20': 3,
   '2017-11-21': 2,
   '2017-11-22': 1,
   '2017-11-23': 1,
   '2017-11-24': 1,
   '2017-11-25': 2,
   '2017-11-26': 0,
   '2017-11-27': 1,
   '2017-11-28': 4,
   '2017-11-29': 3,
   '2017-11-30': 3,
   '2017-12-01': 3,
   '2017-12-02': 3,
   '2017-12-03': 0,
   '2017-12-04': 1,
   '2017-12-05': 2,
   '2017-12-06': 0,
   '2017-12-07': 2,
   '2017-12-08': 2,
   '2017-12-09': 4,
   '2017-

In [10]:
# Get subkeys for interesting fields
cursor = tinder.find_one()

keys = []
for key in cursor.keys():
    
    if type(cursor[key]) == dict and key in ['conversationsMeta', 'user']:
        for nested_key in cursor[key].keys():
            keys.append(key + '.' + nested_key)
    else:
        keys.append(key)
        
print(keys)

['_id', '__v', 'appOpens', 'conversations', 'conversationsMeta.nrOfConversations', 'conversationsMeta.longestConversation', 'conversationsMeta.longestConversationInDays', 'conversationsMeta.averageConversationLength', 'conversationsMeta.averageConversationLengthInDays', 'conversationsMeta.medianConversationLength', 'conversationsMeta.medianConversationLengthInDays', 'conversationsMeta.nrOfOneMessageConversations', 'conversationsMeta.percentOfOneMessageConversations', 'conversationsMeta.nrOfGhostingsAfterInitialMessage', 'matches', 'messages', 'messagesReceived', 'messagesSent', 'swipeLikes', 'swipePasses', 'swipes', 'user.birthDate', 'user.ageFilterMin', 'user.ageFilterMax', 'user.cityName', 'user.country', 'user.createDate', 'user.education', 'user.gender', 'user.interestedIn', 'user.genderFilter', 'user.instagram', 'user.spotify', 'user.jobs', 'user.educationLevel', 'user.schools', 'userId']


# Data exploration

Look at slides

## Missing values

In [86]:
for key in keys:
    try: 
        distint_values = tinder.distinct(key)
        if '' in distint_values: print(f"'' in {key}")
        if None in distint_values: print(f"None in {key}")
        if 'none' in distint_values: print(f"none in {key}")
        if 'null' in distint_values: print(f"null in {key}")
        if 'Null' in distint_values: print(f"Null in {key}")
        if 'void' in distint_values: print(f"void in {key}")
        if 'Void' in distint_values: print(f"Void in {key}")
        if '*' in distint_values: print(f"* in {key}")
    except: 
        pass

'' in user.cityName
* in user.cityName
'' in user.country
'' in user.gender
'' in user.interestedIn
'' in user.genderFilter


In [89]:
# Get all possible values of occupation
cursor = tinder.distinct('user.genderFilter')
list(cursor)

['', 'F', 'M', 'M and F']

In [9]:
# Get all possible values of occupation
cursor = tinder.distinct('user.educationLevel')
list(cursor)

['Has high school and/or college education',
 'Has no high school or college education']

In [13]:
cursor = tinder.find({"user.cityName": ''},
                     {'user': 1, 'conversationsMeta': 1, 'matches': 1})
list(cursor)

[{'_id': '7060673d6a49675f985a98f463bb0350',
  'conversationsMeta': {'nrOfConversations': 0,
   'longestConversation': 0,
   'longestConversationInDays': 0,
   'averageConversationLength': 0,
   'averageConversationLengthInDays': 0,
   'medianConversationLength': 0,
   'medianConversationLengthInDays': 0,
   'nrOfOneMessageConversations': 0,
   'percentOfOneMessageConversations': 0,
   'nrOfGhostingsAfterInitialMessage': 0},
  'matches': {'2021-02-19': 0,
   '2021-02-20': 0,
   '2021-02-21': 0,
   '2021-02-23': 0,
   '2021-02-24': 0,
   '2021-03-24': 0,
   '2021-04-12': 3,
   '2021-04-13': 0,
   '2021-04-14': 0},
  'user': {'birthDate': '1900-11-11T00:00:00.000Z',
   'ageFilterMin': 29,
   'ageFilterMax': 1000,
   'cityName': '',
   'country': '',
   'createDate': '2021-02-19T09:23:23.940Z',
   'education': 'Has no high school or college education',
   'gender': 'M',
   'interestedIn': 'F',
   'genderFilter': 'F',
   'instagram': False,
   'spotify': False,
   'jobs': [{'companyDisplay

# Data analysis

# Conclusions

# Close connection with database

In [2]:
# Drop table
tinder.drop()

# Close connection
client.close()