# WPC-Social-Media-Presence STARTER KIT

This part of the WPC Starter Kit covers how to find social media links on enterprise websites. Taking as input a list of URLs - that were for example found with this [URL finder](https://github.com/EnterpriseCharacteristicsESSnetBigData/StarterKit/blob/master/URLsFinder/URLs_Finder_Starte_Kit.ipynb) - the social media links on the websites are identified and saved.


### 0. Prerequisites - how to set up the Python environment


Python 3 is used for this library. We recommend to install Python with the Anaconda distribution, which can be obtained [here](https://www.anaconda.com/distribution/).


Remember to use only Python version 3 - on Python 2 the library will not work.


An installation of the following libaries is needed:
<ul>
    <li>bs4</li> 
    <li>requests (part of Anaconda)</li>
</ul>



### 1. Importing the library

The file SocialMediaCollector.py is located in the src folder.


In [2]:
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

import SocialMediaPresenceCollector as smpc

### 2. Input file

The data source containing the URLs can be an iterable like a list, or a data frame column containing enterprise URLs. For this starer kit, we use an input file that is line-separated and looks like this:

maslankowski.pl<br/>
http://stat.gov.pl<br/>
www.ug.edu.pl


In [3]:
data_file = 'url.txt'

In [4]:
# reading in the input file
url_data = open(data_file,"r")

### 3. Initiating the class SocialMediaPresence
The class SocialMediaPresence has one parameter: a dictionary of social media platforms with lists of their domain URLs as values. By default, it will collect social media links to Facebook, Twitter, Youtube, LinkedIn, Instagram, Xing and Pinterest. We will use this default in the starter kit.

In [5]:
smp=smpc.SocialMediaPresence()

This shows the default dictionary with social media domains. Different social media platforms can be included.

In [6]:
smp.social_media_dict

{'Facebook': ['facebook.com'],
 'Twitter': ['twitter.com'],
 'Youtube': ['youtu.be', 'youtube.com'],
 'LinkedIn': ['linkedin.com'],
 'Instagram': ['instagram.com'],
 'Xing': ['xing.com'],
 'Pinterest': ['pinterest.com']}

### 4. Scraping URLs and finding Social Media Links
Before scraping URLs and finding social media links, a FileAccess object has to be instantiated that will save the found links in both a json and csv file.

In [6]:
# Initiating the class FileAccess
fa = smpc.FileAccess()
# Results will be stored in a list
jsonList = []

We loop through the URLs from our input file and first scrape these URLs with the requests library. Then, we use the SocialMediaLinks method of SocialMediaPresence to identify and safe social media links.

In [7]:
for url in url_data:
    print("\nWebsite currently being scraped:", url)
    jsonList.append(smp.searchSocialMediaLinks(url))


Website currently being scraped: stat.gov.pl

Exception during scraping content of the webpage: http://stat.gov.pl

Website currently being scraped: http://ug.edu.pl

The length of the scrapped content: 60255 characters
Number of links on website: 134
https://www.facebook.com/UniwersytetGdanski
https://twitter.com/uniwersytet_gd
https://www.instagram.com/uniwersytet_gdanski/
https://www.youtube.com/channel/UCOrHv73IWNIetJveGjV_zLA
https://pl.linkedin.com/school/uniwersytet-gda%C5%84ski/
Total number of unique social media links found: 5

Website currently being scraped: maslankowski.pl

The length of the scrapped content: 4869 characters
Number of links on website: 14
https://twitter.com/jmaslankowski
https://twitter.com/jmaslankowski?ref_src=twsrc%5Etfw
Total number of unique social media links found: 2


In [8]:
fa.jsonListWrite(jsonList)


### 5. Output files

The output of the application are two files:
<b>wpc_social.csv</b>
and
<b>wpc_social_YYYYMMDDHHMMSSnnnnnnn.json</b>
<br/><br/>
The file <b>wpc_social.csv</b> is updated with its content. 
<br/><br/>
The json file is created every time of the application running.
<br/><br/>
