# Scraping Foreign Investment Data

This project aims to scrape the newsletters released by the Department for Promotion of Industry and Internal Trade (DPIIT). There are several challenges to this.

1. While the DPIIT does release .csvs of some of their data, it only calculates what amount of investment from each foreign country into India is. It has been widely reported that corporations have been using tax havens such as Mauritius, Cayman Islands and the Netherlands to route money into India. What hasn't been reported widely are the companies engaging in it. 

2. The company data is released through quarterly newsletters. Unfortunately, this data is in the form of PDFs, which means that the data cannot be assesed. This means scraping these PDFs and putting them down. 

3. Companies invest in India in 3 ways:
    - By buying stock in those companies
    - By just giving them money
    - Going through the reserve bank of India. 
Our challenge is to arrange the data to reflect these 3 ways. How would researchers want to access this data?

The aim is to scrape data for at least the last 4 years and we will try to make sense of them in the aggregate. 

This project will use camelot, pandas and a mapping software.

## First, we get all the quarters

This is the link to the homepage of the Newsletter: https://dpiit.gov.in/publications/si-news-letters

### Using BeautifulSoup, we first scrape the documentation of all the quarters in a month

In [13]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
#This selects the page with all the links to the quarters
my_url = "https://dpiit.gov.in/publications/si-news-letters"
raw_html = requests.get(my_url).content
soup_doc = BeautifulSoup(raw_html, "html.parser")

In [4]:
#This is a loop to select all the URLS within the table

links = soup_doc.select(".views-table.cols-3 a")
urls = []
for link in links:
    print(link["href"])
    urls.append(link["href"])
urls
homepage = 'https://dpiit.gov.in'
fullurls = [homepage + x for x in urls]
fullurls

/sia-newsletter/fdi-newsletter-vol-xxxi-no-2-october-2022
/sia-newsletter/fdi-newsletter-vol-xxxi-no-1-july-2022
/sia-newsletter/fdi-newsletter-vol-xxx-no-4-april-2022
/sia-newsletter/fdi-newsletter-vol-xxx-no-3-january-2022
/sia-newsletter/fdi-newsletter-vol-xxx-no-2-october-2021
/sia-newsletter/fdi-newsletter-vol-xxx-no-1-july-2021
/sia-newsletter/fdi-newsletter-vol-xxix-no-4-april-2021
/sia-newsletter/fdi-newsletter-vol-xxix-no-3-january-2021
/sia-newsletter/fdi-newsletter-vol-xxix-no-2-october-2020
/sia-newsletter/fdi-newsletter-vol-xxix-no-1-july-2020
/sia-newsletter/fdi-newsletter-vol-xxviii-no-4-april-2020
/sia-newsletter/fdi-newsletter-vol-xxviii-no-3-january-2020
/sia-newsletter/fdi-newsletter-vol-xxviii-no-2-october-2019
/sia-newsletter/fdi-newsletter-vol-xxviii-no-1-july-2019
/sia-newsletter/fdi-newsletter-vol-xxvii-no-4-april-2019
/sia-newsletter/fdi-newsletter-vol-xxvii-no-3-january-2019
/sia-newsletter/fdi-newsletter-vol-xxvii-no-2-october-2018
/sia-newsletter/fdi-newslet

['https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxxi-no-2-october-2022',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxxi-no-1-july-2022',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxx-no-4-april-2022',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxx-no-3-january-2022',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxx-no-2-october-2021',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxx-no-1-july-2021',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxix-no-4-april-2021',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxix-no-3-january-2021',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxix-no-2-october-2020',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxix-no-1-july-2020',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxviii-no-4-april-2020',
 'https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxviii-no-3-january-2020',
 'https://dpiit.gov.in/sia-news

## Then we scrape the newsletters within the quarter
Then we made a loop for an individual quarter. These are the links for details released on July 2022: https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxxi-no-1-july-2022

From here, we need the last 3 links, which go directly to the pdfs:

1. Approval: https://dpiit.gov.in/sites/default/files/Table_No_19a_JUNE_22.pdf
2. Shares: https://dpiit.gov.in/sites/default/files/Table_No_19b_JUNE_22.pdf
3. RBI: https://dpiit.gov.in/sites/default/files/Table_No_19c_JUNE_22.pdf


In [5]:
qtr_url = "https://dpiit.gov.in/sia-newsletter/fdi-newsletter-vol-xxxi-no-1-july-2022"
raw_html = requests.get(qtr_url).content
doc = BeautifulSoup(raw_html, "html.parser")

In [6]:
pdf_table = doc.select("ul")[-4]
ind_pdf = pdf_table("a")
pdfs = []
for link in ind_pdf:
    print(link['href'])
    pdfs.append(link['href'])
pdfs

https://dpiit.gov.in/sites/default/files/Table_No_19a_JUNE_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19b_JUNE_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19c_JUNE_22.pdf


['https://dpiit.gov.in/sites/default/files/Table_No_19a_JUNE_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19b_JUNE_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19c_JUNE_22.pdf']

## MASTER LOOP

Add the loops and get all the remmitance wise PDFs!

In [7]:
pdfs = []
for url in fullurls:
    
    # Step 1 : get the page
    qtr_url = url
    raw_html = requests.get(qtr_url).content
    doc = BeautifulSoup(raw_html, "html.parser")

    # Step 2: Get the pdf links
    pdf_table = doc.select("ul")[-4]
    ind_pdf = pdf_table("a")
    
    for link in ind_pdf:
        print(link['href'])
        pdfs.append(link['href'])
pdfs

https://dpiit.gov.in/sites/default/files/Table_No_19a_SEPT_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19b_SEPT_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19c_SEPT_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19a_JUNE_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19b_JUNE_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19c_JUNE_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19a_MAR_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19b_MAR_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19c_MAR_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19a_DEC_21.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19b_DEC_21.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19c_DEC_21.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19a_SEPT_21_0.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19b_SEPT_21.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19c_SEPT_21.pdf
https://dpiit.

['https://dpiit.gov.in/sites/default/files/Table_No_19a_SEPT_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19b_SEPT_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19c_SEPT_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19a_JUNE_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19b_JUNE_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19c_JUNE_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19a_MAR_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19b_MAR_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19c_MAR_22.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19a_DEC_21.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19b_DEC_21.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19c_DEC_21.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19a_SEPT_21_0.pdf',
 'https://dpiit.gov.in/sites/default/files/Table_No_19b_SEPT_21.pdf',
 'https://dpiit.gov.in/s

I GOT LAZY
So I copied this onto a text file and ran it through curl with this command.
I then ran this command in the command line:
```cat url.txt | xargs -I {}  wget {}```

Alternatively, we can use the below command to loop through and download the files, this way you don't have to sit at your computer non-stop for an hour.

In [8]:
from os.path import exists

In [9]:
with open ("url.txt", mode="w") as file:
    
    for url in pdfs:
        file.write(url+"\n")
        print(url)
        filename = url.split("/")[-1]
        print(filename)
        if exists(filename):
            print("file already exists")
        else:
            response = requests.get(url)
            with open (filename, mode="wb") as pdf_file:
                pdf_file.write(response.content)


https://dpiit.gov.in/sites/default/files/Table_No_19a_SEPT_22.pdf
Table_No_19a_SEPT_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19b_SEPT_22.pdf
Table_No_19b_SEPT_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19c_SEPT_22.pdf
Table_No_19c_SEPT_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19a_JUNE_22.pdf
Table_No_19a_JUNE_22.pdf
file already exists
https://dpiit.gov.in/sites/default/files/Table_No_19b_JUNE_22.pdf
Table_No_19b_JUNE_22.pdf
file already exists
https://dpiit.gov.in/sites/default/files/Table_No_19c_JUNE_22.pdf
Table_No_19c_JUNE_22.pdf
file already exists
https://dpiit.gov.in/sites/default/files/Table_No_19a_MAR_22.pdf
Table_No_19a_MAR_22.pdf
file already exists
https://dpiit.gov.in/sites/default/files/Table_No_19b_MAR_22.pdf
Table_No_19b_MAR_22.pdf
file already exists
https://dpiit.gov.in/sites/default/files/Table_No_19c_MAR_22.pdf
Table_No_19c_MAR_22.pdf
https://dpiit.gov.in/sites/default/files/Table_No_19a_DEC_21.pdf
Table_No_19a_DEC_21.pd