# Part 1: Collecting subreddit "India" Data

Source: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768

Importing necessary libraries

In [2]:
! pip install lzma
import pandas as pd # data manipulation 
import praw # python reddit API wrapper

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not find a version that satisfies the requirement lzma (from versions: none)[0m
[31mERROR: No matching distribution found for lzma[0m


AttributeError: module 'pandas' has no attribute 'core'

Authenticate ourselves by creating a Reddit instance and providing client credentials

In [None]:
# Credentials generated from the reddit developers applications page
my_client_id = '8KS6G6Nt9BU9sg'
my_client_secret = 'CkeVFda-vf0DbseDb0eEr1YMpJo'
user = 'reddit_scrape'

reddit = praw.Reddit(client_id=my_client_id, client_secret=my_client_secret, user_agent=user)

## Get subreddit data 
Using the `reddit` instance from the previous section, we can aquire top 1000 posts or 1000 hottest posts or latest posts from reddit. 

In [None]:
num_of_posts = 1000   # Number of posts we want in our data
new_posts = reddit.subreddit('India').top(limit=num_of_posts)

In [None]:
# Printing the titles of the top_posts in this subreddit
for post in new_posts:
    print(post.title)

Looking at the data type of the `new_posts` to identify the class it belongs to and to get more information about the various methods and attributes. 
Answer - object of the ListingGenerator class

In [None]:
print(type(new_posts))

## Structure subreddit data into a pandas dataframe

Create a list of features that have to be saved in the dataset. These are data columns that we will extract from our dataset. 

**Feature Description:**

* **Title:** The title of the submission.
* **Score:** The number of upvotes for the submission.
* **ID:** ID of the submission.
* **Subreddit:** Provides an instance of Subreddit.
* **URL:** The URL the submission links to, or the permalink if a selfpost.
* **Orginal**: Whether or not the submission has been set as original content.
* **num_comments:** The number of comments on the submission.
* **Body:** The submissions’ selftext - an empty string if a link post.
* **created_on:** Time the submission was created, 


In [None]:
features = ['Title', 'Score', 'ID', 'Subreddit', 'URL', 'Original', 'num_comments', 'Flair', 'Body', 'created_on']
posts = [] # List containing the data from individual reddit posts. Each item will be a new entry in the dataframe

In [None]:
india_sub = reddit.subreddit('India')

# Loop through each subreddit entry and append that t
for post in india_sub.top(limit=num_of_posts):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.is_original_content, post.num_comments, post.link_flair_text, post.selftext, post.created])

# Convert to a data frame 
posts = pd.DataFrame(posts, columns=features)

Let's look at the data frame and analyse the data a little more. 

In [None]:
posts

### Correct the time format using the datetime library

In [None]:
# Using the to_datetime function of pandas
posts['creation_date'] = pd.to_datetime(posts['created_on'], dayfirst=True, unit='s')

In [None]:
# Drop created_on column now 
posts.drop(['created_on'], axis=1, inplace=True)

In [None]:
posts.head()

In [None]:
posts.shape

Even after changing the `num_of_posts`, I still getting only 988 posts and a very skewed dataset. Hence, I need to create a better scraping mechanism to get a more balanced Dataset

# Data Collection: Improvised

The first thing I will do is to create lesser number of targets for classification for more robust models and increase the accuracy. 

In [None]:
posts['Flair'].value_counts().sort_values(ascending=False)

I sorted the flairs in the descending order and picked up the most popular flairs to avoid skewed data. I will collect top posts and their comments along with some other information relavnt for the analysis and the model. 

In [None]:
# Relevant flairs
flairs = ["AskIndia", "Non-Political", "[R]eddiquette", 
          "Photography", "Science/Technology",
          "Politics", "Business/Finance", "Policy/Economy",
          "Sports", "Food", "AMA", "Coronavirus", "CAA-NRC-NPR"]

In [None]:
# Data features that we will be collecting 
features = [
    'Title', 
    'Score', 
    'ID',
    'URL', 
    'num_comments', 
    'created_on', 
    'Body', 
    'Original',
    'Flair', 
    'Comments'
]

In [None]:
# Create a subreddit instance 
subreddit = reddit.subreddit('india')

In [None]:
posts = []
# Top 250 posts of each type 
for flair in flairs: 
    relevant_subs = subreddit.search(f"flair_name:{flair}", limit=250)
    
    for sub in relevant_subs:
        post = []
        post = [
            str(sub.title),
            str(sub.score),
            str(sub.id),
            str(sub.url),
            sub.num_comments,
            str(sub.created),
            str(sub.selftext),
            sub.is_original_content,
            str(sub.link_flair_text),
        ]
        
        sub.comments.replace_more(limit=0)
        comment = ''
        for top_comment in sub.comments:
            comment = str(top_comment.body) + ' '        
        
        post.append(str(comment))# Add to the end of the list 
        posts.append(post)    # Add to the main list 

In [None]:
# Convert to a data frame 
posts = pd.DataFrame(posts, columns=features)

In [None]:
posts

In [None]:
# More detailed Data
posts.to_csv('data.csv')