### The goal of this exercise is finding duplicates in this file.
### However, what we consider "duplicate" might vary from person to person, so in this exercise we'll generate different files for each approach we use to find duplicates.

### For example, we might consider one datapoint as duplicate if various relevant columns (features) of the dataframe have the same values.
### For example, we could abide by the rule: if "name" and "id" coincide, then it's a duplicate.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('/home/local/Documents/dealroom-ds-task/data/data_scientist_duplicate_detection.csv')
df

Unnamed: 0,id,name,tagline,industry,industry_2,type,address,street,street_number,zip,country,city,sectors
0,1501006,QA Education,"QA Education Magazine, guide to purchasing ser...",Kids,,2,"United Kingdom, Chorley",,,PR7 2,United Kingdom,Chorley,
1,1370997,Seeking Millionaire,Millionaire dating website that matches those ...,Dating,,2,"United States, Las Vegas",,,,United States,Las Vegas,
2,1726114,LullaMe,Manufacturer of self-rocking baby mattress for...,Kids,Health,2,"Kuortanegatan 2, 00510 Helsingfors, Finland",Kuortanegatan,2,00510,Finland,Helsingfors,"femtech,automated process,outside tech,Automat..."
3,1475926,Book Box,Book Box,Kids,Wellness Beauty,2,"02000 Kiev, Ukraine",,,02000,Ukraine,Kiev,Human Resources
4,223088,Uncharted Play,,Kids,,2,"United States, New York",,,10007,United States,New York,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15891,868858,"Cunesoft, a Phlexglobal company",Cloud-based regulatory software solutions and ...,Enterprise Software,Health,2,"Luise-Ullrich-Straße 20, 80636 Munich, Germany",Luise-Ullrich-Straße,20,,Germany,Munich,"back office,moving services,Data Analytics,dat..."
15892,1464030,User Interface Design Gmbh,UX Design & UI Development | UID User Interfac...,Enterprise Software,,2,"Germany, Ludwigsburg",,,71638,Germany,Ludwigsburg,Innovation Radar
15893,6564,ConWeaver,Softwares for the automatic integration of cor...,Enterprise Software,,2,"Friedenspl. 12, 64283 Darmstadt, Germany",Friedensplatz,,,Germany,Darmstadt,"industrial analytics,industrial technology,Avi..."
15894,147345,StackShare,Developer Social Network,Enterprise Software,Media,2,"United States, Mountain View",,,,United States,Mountain View,"Open Source,chat,social network,Infrastructure..."


In [54]:
dups_id_name_overlap = df.loc[df.duplicated(subset=['id', 'name']), :]

# Export duplicates to file.
dups_id_name_overlap.to_csv("~/Documents/dealroom-ds-task/output_csv/dups_id_name_overlap.csv")

In [53]:
dups_tagline_name_overlap = df.loc[df.duplicated(subset=['tagline', 'name']), :]

# Export duplicates to file.
dups_tagline_name_overlap.to_csv("~/Documents/dealroom-ds-task/output_csv/dups_tagline_name_overlap.csv")

In [55]:
# Duplicate = Coincidence in "zip" and "name".
dups_zip_name_overlap = df.loc[df.duplicated(subset=['zip', 'name']), :]

# Export duplicates to file.
dups_zip_name_overlap.to_csv("~/Documents/dealroom-ds-task/output_csv/dups_zip_name_overlap.csv")

In [46]:
# Duplicate = Coincidence in "address" and "name".
dups_adress_name_overlap = df.loc[df.duplicated(subset=['address', 'name']), :]

# Export duplicates to file.
dups_adress_name_overlap.to_csv("~/Documents/dealroom-ds-task/output_csv/dups_adress_name_overlap.csv")

pandas.core.frame.DataFrame

### We considered different options to check for duplicates and we dumped the duplicate datapoints to different files in order to decide later which datapoints to drop, depending on which criteria we find the most fitting.

### Now, we'll explore some more "manual" ways to find the number of duplicates in total in the dataset.

In [15]:
# Alternate way to check if there's any duplicate (we check all the elements of the row, row by row).

temp_var = df.duplicated()
dup_num = 0

for dup in temp_var:
    if dup == True:
        dup_num += 1
    
print(dup_num)

type(temp_var)

1


pandas.core.series.Series

In [17]:
# Alternate Approach for checking total row duplicates. (Exact coincidence, element per element in a row.)
df.duplicated().sum()

1

In [None]:
# We'll check another column.



In [18]:
# We'll show on the screen the duplicated rows. It will show the UNIQUE INSTANCES that are duplicated.
# This means that even if a row is duplicated multiple times, it will only show up once in the output.
df.loc[df.duplicated(), :]

Unnamed: 0,id,name,tagline,industry,industry_2,type,address,street,street_number,zip,country,city,sectors
655,867000,Swarovski Group,Homepage - Swarovski Group,,,4,"Switzerland, Männedorf, Alte Landstrasse, 411",Alte Landstrasse,411,8708,Switzerland,Männedorf,


In [27]:

# We create the "duplicates variable" from the command we ran above.
duplicates = df.loc[df.duplicated(), :]

# We dump our duplicates to a file.
duplicates.to_csv("~/Documents/dealroom-ds-task/output_csv/duplicates.csv")

In [28]:
df.loc[df.tagline.duplicated(), :]

Unnamed: 0,id,name,tagline,industry,industry_2,type,address,street,street_number,zip,country,city,sectors
25,2851294,BotBuilders.Tech,,Kids,,2,"Bengaluru, Karnataka 560060, India",,,560060,India,Bengaluru,
50,130636,iCreate,,Kids,,2,"United States, Austin",,,,United States,Austin,
65,1314213,The Johns Hopkins Bloomberg School of Public H...,"Research, education, and practice to find solu...",Kids,Education,8,"United States, Baltimore",,,,United States,Baltimore,
69,2026805,"Electus Global Education Co., Inc.",,Kids,,2,"Tampa, FL, United States",,,33602,United States,Tampa,
72,2944774,Photon,,Kids,Education,2,"Bialystok, Woj. Podlaskie, Poland",,,15-007,Poland,Bialystok,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15789,180438,3Seventy,,Enterprise Software,,2,"United States, Austin",,,,United States,Austin,
15814,2941985,Paperspace,,Enterprise Software,,2,"Brooklyn, NY, United States",,,11201,United States,Brooklyn,
15837,134996,PerfectSearch,,Enterprise Software,,2,"United States, Orem",,,,United States,Orem,
15856,151814,Solstice Mobile,,Enterprise Software,,2,"United States, Chicago",,,,United States,Chicago,


In [34]:
# We take a different approach to find duplicates.
# Here we can see what "company names" are duplicated.
df.dropna().loc[df.name.duplicated(), :]

Unnamed: 0,id,name,tagline,industry,industry_2,type,address,street,street_number,zip,country,city,sectors
4807,123520,Planetarians,Alternative protein from upcycled seeds at the...,Food,Wellness Beauty,2,"630 Hansen Way, Palo Alto, CA 94304, USA",Hansen Way,630,94304,United States,Palo Alto,"grains,superfood,Nutrition Solution,Innovative..."
6498,1470717,ASG,Designs and manufactures resistive and superco...,Energy,Semiconductors,4,"Corso Ferdinando Maria Perrone, 73, 16152 Geno...",Corso Ferdinando Maria Perrone,73,16152,Italy,Genova,"component,Medical,electromagnetic,Medical Tech..."
6535,863052,Breeze Technologies,Breeze pushes the limits of environmental sens...,Energy,Real Estate,2,"Harburger Schloßstraße 6-12, 21079 Hamburg, Ge...",Harburger Schloßstraße,6-12,21079,Germany,Hamburg,"sustainable development goals,urban tech,Susta..."
7668,892048,Pixyl,Working to improve patient care by placing the...,Health,Enterprise Software,2,"5 Avenue du Grand Sablon, 38700 La Tronche, Fr...",Avenue du Grand Sablon,5,38700,France,La Tronche,"healthcare,neurology,patient care,stroke,Biote..."
8623,1449811,MeetinVR,Combines the flexibility of online meetings wi...,Fashion,Telecom,2,"6, Annexstræde, 2500 Copenhagen, Denmark",Annexstræde,6,2500,Denmark,Copenhagen,"platform,SaaS,Social"
9556,874024,Payments & Cards Network,"Payments & Cards Network - Jobs in Payments, F...",Fintech,Jobs Recruitment,2,"Netherlands, Amsterdam, Herengracht, 576",Herengracht,576,1017 CJ,Netherlands,Amsterdam,"financial service,credit card,network,Recruitm..."
9613,16240,Quadrem,Purchase-to-pay marketplace provider,Fintech,Enterprise Software,2,"Netherlands, Amsterdam, Kabelweg, 61",Kabelweg,61,1014 BA,Netherlands,Amsterdam,"Accounting,Procurement,StartupAmsterdam2020,St..."
11522,1452118,WritePath,"AI translation solution for User manual, Inves...",Education,Fintech,2,"Taiwan, Taipei City, Lane 559, Sec 4, Zhong Xi...","Lane 559, Sec 4, Zhong Xiao E. Rd.",1,110,Taiwan,Taipei City,translation
11853,980602,Flowtap,Flowtap is your partner for artificial intelli...,Marketing,Enterprise Software,2,"33, Schottenring, 1010 Vienna, Austria",Schottenring,33,1010,Austria,Vienna,"CRM,Analytics,SaaS"
13068,867177,Softeam,Consulting and technology services in France,Home Living,Media,15,"France, Paris, Avenue des Champs-Élysées, 70",Avenue des Champs-Élysées,70,75008,France,Paris,"appliances,outside tech,agency,consulting serv..."


In [14]:
# We can thing of this as a table in a Relational Database: same ID means it's the same element, thus duplicate.

# In order to find duplicates, we count the instances of each row.
# For that purpuse, we use Counter from collections.
from collections import Counter

id_counts = Counter(df['id'])

# or key, value in id_counts.items():
#    print(key)


# We loop through every key-value pair in the Counter dictonary. If the count is greater than one, print the ID (has duplicates)
for key, value in id_counts.items():
    if value > 1:
        # print(key)
        # print(value)
        print("({}, {})".format(key, value), end= "\n" * 2)
        print("The key (or ID) {} has a total of {} instances in this CSV file".format(key, value))


# Thenm we'll check how many duplicates the key has by calling the key.
# id_counts['867000']

(867000, 2)

The key (or ID) 867000 has 2 instances in this CSV file


As you can see, there's only one row that is duplicated.
It has two ocurrences in total in the .csv file.

In [6]:
for value in id_counts.values():
    if value > 1:
        print(value)

2


In [7]:
id_counts['867000']

0

In [20]:
# id_counts