# WPC-Social-Media-Presence STARTER KIT

This part of the WPC Starter Kit covers how to find social media links on enterprise websites.

Input: either a list of URLs or scraped data in a MongoDB database as returned by [DomainScraper](https://github.com/EnterpriseCharacteristicsESSnetBigData/StarterKit/tree/master/URLScraper)

Output: A csv file containing domain names and found social media links


## 0. Prerequisites - how to set up the Python environment


Python 3 is used for this library. We recommend to install Python with the Anaconda distribution, which can be obtained [here](https://www.anaconda.com/distribution/).


Remember to use only Python version 3 - on Python 2 the library will not work.


An installation of the following libaries is needed:
<ul>
    <li>bs4</li> 
    <li>requests (part of Anaconda)</li>
</ul>



## 1. Importing the library

The file SocialMediaCollector.py is located in the src folder.


In [1]:
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

import SocialMediaPresenceCollector as smpc

## 2. Initiating the class SocialMediaPresence
The class SocialMediaPresence has one parameter: a dictionary of social media platforms with lists of their domain URLs as values. By default, it will collect social media links to Facebook, Twitter, Youtube, LinkedIn, Instagram, Xing and Pinterest. We will use this default in the starter kit.

In [2]:
smp=smpc.SocialMediaPresence()

This shows the default dictionary with social media domains. Different social media platforms can be included.

In [3]:
smp.social_media_dict

{'Facebook': ['facebook.com'],
 'Twitter': ['twitter.com'],
 'Youtube': ['youtu.be', 'youtube.com'],
 'LinkedIn': ['linkedin.com'],
 'Instagram': ['instagram.com'],
 'Xing': ['xing.com'],
 'Pinterest': ['pinterest.com']}

## 3. Option 1: Providing a list of URLs as input


### 3.1 Reading the URL data
The data source containing the URLs can be an iterable like a list, or a data frame column containing enterprise URLs. For this starer kit, we use an input file that is line-separated and looks like this:

maslankowski.pl<br/>
http://stat.gov.pl<br/>
www.ug.edu.pl


In [4]:
data_file = 'url.txt'

In [5]:
# reading in the input file
url_data = open(data_file,"r")

### 3.2 Scraping URLs and finding Social Media Links
Before scraping URLs and finding social media links, a FileAccess object has to be instantiated that will save the found links in both a json and csv file.

In [6]:
# Initiating the class FileAccess
fa = smpc.FileAccess()
# Results will be stored in a list
jsonList = []

We loop through the URLs from our input file and first scrape these URLs with the requests library. Then, we use the SocialMediaLinks method of SocialMediaPresence to identify and safe social media links.

In [7]:
for url in url_data:
    print("\nWebsite currently being scraped:", url)
    jsonList.append(smp.searchSocialMediaLinks(url))


Website currently being scraped: stat.gov.pl

The length of the scrapped content: 90079 characters
Number of links on website: 255
https://www.youtube.com/channel/UC0wiQMElFgYszpAoYgTnXtg/featured
https://www.facebook.com/GlownyUrzadStatystyczny/
http://twitter.com/GUS_STAT
https://www.linkedin.com/company/532930
https://www.instagram.com/gus_stat/
https://twitter.com/GUS_STAT/lists/gus-i-urz-dy-statystyczne?ref_src=twsrc%5Etfw
Total number of unique social media links found: 6

Website currently being scraped: http://ug.edu.pl

The length of the scrapped content: 60769 characters
Number of links on website: 137
https://www.facebook.com/UniwersytetGdanski
https://twitter.com/uniwersytet_gd
https://www.instagram.com/uniwersytet_gdanski/
https://www.youtube.com/channel/UCOrHv73IWNIetJveGjV_zLA
https://pl.linkedin.com/school/uniwersytet-gda%C5%84ski/
Total number of unique social media links found: 5

Website currently being scraped: maslankowski.pl

The length of the scrapped content: 4

In [13]:
fa.jsonListWrite(jsonList)


### 3.3 Output files

The output of the application are two files:
<b>wpc_social.csv</b>
and
<b>wpc_social_YYYYMMDDHHMMSSnnnnnnn.json</b>
<br/><br/>
The file <b>wpc_social.csv</b> is updated with its content. 
<br/><br/>
The json file is created every time of the application running.
<br/><br/>


## 4. Option 2: Using scraped data obtained by DomainScraper as input
In case you want to scrape enterprise websites once and then derive different characteristics from it, it makes sense to first scrape and save the data with DomainScraper. Your MongoDB database can serve as data source for the SocialMediaPresenceCollector.

This part only works if you already created a MongoDB database with web scraped enterprise websites.

### 4.1 Create a connection to the MongoDB server

In [8]:
from pymongo import MongoClient

In [9]:
host='localhost'
port=27017
# define the client connection
# host - default localhost
# port - default 27017
client=MongoClient('mongodb://'+str(host)+":"+str(port))

In [10]:
dbname='URLScraping' # Change to the name of your MongoDB database with scraped data
try:
    database=client[dbname]
except:
    print('Error connecting the database', sys.exc_info()[0])

In [11]:
collectionName = database.websites # Change to your collection name

### 4.2 Use scraped data to find social media links
Results are saved in the file 'wp2_social.csv' within your working directory.

Optional: You can specify a timespan in searchSocialMediaLinksNoSQL so that only data that was scraped within that time will be used by the SocialMediaPresenceCollector. 

Default: Starting date: '2020-01-01'; End date: today

In [12]:
smp.searchSocialMediaLinksNoSQL(collectionName, dateFrom='2020-01-01', dateUntil=datetime.now())

stat.gov.pl 2020-09-16 16:53:09.260800
The length of the scrapped content: 89746 characters
Number of links on website: 255
https://www.youtube.com/channel/UC0wiQMElFgYszpAoYgTnXtg/featured
https://www.facebook.com/GlownyUrzadStatystyczny/
http://twitter.com/GUS_STAT
https://www.linkedin.com/company/532930
https://www.instagram.com/gus_stat/
https://twitter.com/GUS_STAT/lists/gus-i-urz-dy-statystyczne?ref_src=twsrc%5Etfw
Total number of unique social media links found: 6
{'URL': 'stat.gov.pl', 'Facebook': ['https://www.facebook.com/GlownyUrzadStatystyczny/'], 'Twitter': ['http://twitter.com/GUS_STAT', 'https://twitter.com/GUS_STAT/lists/gus-i-urz-dy-statystyczne?ref_src=twsrc%5Etfw'], 'Youtube': ['https://www.youtube.com/channel/UC0wiQMElFgYszpAoYgTnXtg/featured'], 'LinkedIn': ['https://www.linkedin.com/company/532930'], 'Instagram': ['https://www.instagram.com/gus_stat/'], 'Xing': [], 'Pinterest': []}
destatis.de 2020-09-16 16:53:09.695800
The length of the scrapped content: 83069 cha