# Final Project of Big Data & Automated Content Analysis
## Are Housewives More Depressed? A Text and Visual Emotion Detection Study Based on YouTube <br>
## Part 1. Data Collection
<br>
Student: Dongdong Zhu <br>
Student number: 13523171

This study aims to explore the emotions related to videos of housewives on YouTube using both text analysis and visual analysis. Part 1 aims to collect data from YouTube and it will take around 25 minutes to run this script.

## 0. Install and import packages

In [1]:
!pip install --upgrade google-api-python-client
!pip install --upgrade google-auth google-auth-oauthlib google-auth-httplib2
!pip install youtube-transcript-api

Collecting google-api-python-client
  Downloading google_api_python_client-2.131.0-py2.py3-none-any.whl (11.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: google-api-python-client
  Attempting uninstall: google-api-python-client
    Found existing installation: google-api-python-client 2.129.0
    Uninstalling google-api-python-client-2.129.0:
      Successfully uninstalled google-api-python-client-2.129.0
Successfully installed google-api-python-client-2.131.0


In [2]:
#import general packages
import argparse
import csv
import json
import nltk
import numpy as np
import os
import pandas as pd
import re
import spacy
import string
import sys

from matplotlib import pyplot as plt
from scipy.stats import f_oneway
from scipy import stats

In [3]:
#import GoogleAPI-related packages
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

from youtube_transcript_api import (YouTubeTranscriptApi,
                                    VideoUnavailable, NoTranscriptFound,TranscriptsDisabled,
                                    TooManyRequests,YouTubeRequestFailed)

## 1. Data Collection

### 1.1 Collect Videos from YouTube

In this part, YouTubeAPIv3 was used to collect videos as the datasets in this study. Videos from three hashtags #housewife, #career woman, #vlog was collected respectively.<br>
The code in this part was adapted from: <br>
(1) https://developers.google.com/youtube/v3/docs/search/list <br>
(2) https://stackoverflow.com/questions/54283003/how-to-get-maxresults-for-search-from-youtube-data-api-v3

In [4]:
#define a function to use YouTubeAPIv3
scopes = ["https://www.googleapis.com/auth/youtube.readonly"]

def main():
    # Disable OAuthlib's HTTPS verification when running locally
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    developer_key = "AIzaSyB6Ys8xKWgPjqMAMNX2ul50AtSaGuOErJU"

    return googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=developer_key)

In [5]:
#define a function to search videos from YouTube and return video ids
def youtube_search(hashtag, max_results):
    youtube = main()

    search_response = youtube.search().list(
        part = "id,snippet",
        q = f"#{hashtag}",
        type = "video",
        relevanceLanguage = "en",
        maxResults = max_results,
        order = "date"
    ).execute()

    video_ids = []

    while search_response:
        for search_result in search_response.get("items", []):
            video_ids.append(search_result["id"]["videoId"])

        next_page_token = search_response.get('nextPageToken')
        if next_page_token:
            search_response = youtube.search().list(
                part = "id,snippet",
                q = f"#{hashtag}",
                type = "video",
                relevanceLanguage = "en",
                maxResults = max_results,
                order = "date",
                pageToken = next_page_token
            ).execute()
        else:
            break

    return video_ids

In [6]:
#define a function to save the video ids to csv files
def save_videoids(hashtag, filename, max_results = 50):
    url_prefix = 'https://www.youtube.com/watch?v='
    try:
        ids = youtube_search(hashtag, max_results)
    except HttpError as e:
        print('An HTTP error %d occurred:\n%s' % (e.resp.status, e.content))
    else:
        with open(filename, "w", newline = '') as f:
            writer = csv.writer(f)
            writer.writerow(["video_id", "url"])
            for i in ids:
                writer.writerow([i, url_prefix + i])

In [7]:
#collect videos from three hashtags and save them respectively
if __name__ == "__main__":
    save_videoids('"housewife"', "video_ids_housewife.csv")
    save_videoids('"career woman"', "video_ids_career.csv")
    save_videoids('"vlog"', "video_ids_vlog.csv")

In [8]:
df_housewife = pd.read_csv("video_ids_housewife.csv")
df_housewife.shape

(600, 2)

In [9]:
df_career = pd.read_csv("video_ids_career.csv")
df_career.shape

(534, 2)

In [10]:
df_vlog = pd.read_csv("video_ids_vlog.csv")
df_vlog.shape

(600, 2)

### 1.2 Extact Transcripts from Videos

In this part, the YouTube Transcript API was used to extract the transcripts of videos from the datasets.<br>
The code in this part was adapted from: <br>
https://github.com/jdepoix/youtube-transcript-api

In [11]:
videoids_housewife = df_housewife["video_id"]
videoids_career = df_career["video_id"]
videoids_vlog = df_vlog["video_id"]

In [12]:
#define a funtion to use YouTube Transcript API to extract the transcripts
def get_scripts(video_ids):
    transcripts = {}
    for video_id in video_ids:
        try:
            transcript = YouTubeTranscriptApi.get_transcript(video_id)
            transcripts[video_id] = transcript
        except (NoTranscriptFound, TranscriptsDisabled, VideoUnavailable, YouTubeRequestFailed):
            transcripts[video_id] = "None"
            print(f"No transcript was found for video {video_id}")
    return transcripts

In [13]:
#define a function to save the extracted transcripts to csv files
def save_transcripts(video_ids, filename):
    transcripts = get_scripts(video_ids)

    with open(filename, "w", newline = '') as fo:
        writer = csv.writer(fo)
        writer.writerow(["video_id", "transcript"])
        for video_id, transcript in transcripts.items():
            if transcript != "None":
                transcript_text = " ".join([t["text"] for t in transcript])
            else:
                transcript_text = "None"
            writer.writerow([video_id, transcript_text])

In [14]:
#save the transcripts of videos from three hashtags respectively
if __name__ == "__main__":
    save_transcripts(videoids_housewife, "transcripts_housewife.csv")
    save_transcripts(videoids_career, "transcripts_career.csv")
    save_transcripts(videoids_vlog, "transcripts_vlog.csv")

No transcript was found for video XDa9aoFhggE
No transcript was found for video kx9absNabic
No transcript was found for video X8Yc7HgOBn4
No transcript was found for video XSoDA-aQemY
No transcript was found for video KTuAQxp9IHQ
No transcript was found for video g0D2YVYprhs
No transcript was found for video JeMIY2ELKAY
No transcript was found for video IKAbTQ6YmUM
No transcript was found for video 6ZIUoy6E4rg
No transcript was found for video a_Q7I6jtPR4
No transcript was found for video AnB5pHZzqO0
No transcript was found for video lvApHlGN3ik
No transcript was found for video bOzQQgAvTU8
No transcript was found for video lJl4ANtT4dI
No transcript was found for video LPfwnn5JMMk
No transcript was found for video ZRhwRxc5yUg
No transcript was found for video _PJCs8RU5jw
No transcript was found for video ton8kUUREDI
No transcript was found for video WIEAulFa6OY
No transcript was found for video Yf8Op5lNwz0
No transcript was found for video S_yfLWBlpgg
No transcript was found for video 

No transcript was found for video wdPSCIXcNVc
No transcript was found for video SY_gBF-xRww
No transcript was found for video dM2zqjYXyjs
No transcript was found for video DlSvEjAvG4k
No transcript was found for video 9KqkSN9BGSs
No transcript was found for video ipy2Zdlfxv0
No transcript was found for video 2Z4ClKcFV48
No transcript was found for video skudZAyZCwg
No transcript was found for video E-nmPOVkE7M
No transcript was found for video uGRH8XzMsXc
No transcript was found for video WxudF-09w4o
No transcript was found for video x2jo38TK1nk
No transcript was found for video 5RaoJXlcFUI
No transcript was found for video p9Mw-M0P1Mc
No transcript was found for video 0LCpRdvEYfE
No transcript was found for video fGRToQBTnaw
No transcript was found for video UECv7RtaYX8
No transcript was found for video Dl_Y75eUZJY
No transcript was found for video U4_ghHDu-YE
No transcript was found for video ZRfRCKFjCQc
No transcript was found for video bGWd9urm8Qk
No transcript was found for video 

No transcript was found for video L4x9xXMsx14
No transcript was found for video BITE7ZHqqgw
No transcript was found for video UMndm1NSsGY
No transcript was found for video 9gkf8G6TBug
No transcript was found for video Ls3eAwQtXKI
No transcript was found for video Knl_WwOj2lQ
No transcript was found for video JarDEO1f2UU
No transcript was found for video wl4R5W-LCrI
No transcript was found for video wl4R5W-LCrI
No transcript was found for video qf9qu3UUyTQ
No transcript was found for video acluipIBGTM
No transcript was found for video fWfUACVMgSk
No transcript was found for video rDxC-TU7eJM
No transcript was found for video sIMfF0GNWhI
No transcript was found for video 2yl2m_mzfWc
No transcript was found for video m5lWxawMWqI
No transcript was found for video Yay9U1bQRnU
No transcript was found for video rULYd_T1Hlw
No transcript was found for video R68hhKQxKmk
No transcript was found for video va6SqaMLHUc
No transcript was found for video MqusT_pyI7I
No transcript was found for video 

No transcript was found for video Ns7yzcnHLwQ
No transcript was found for video FGEzcbJcQIQ
No transcript was found for video -7-b858L-20
No transcript was found for video c0V8Pxir4GE
No transcript was found for video JQmWCaQfSI4
No transcript was found for video qCnTq5_SbVc
No transcript was found for video 300g5eM4cog
No transcript was found for video ZHkJLaSsCAs
No transcript was found for video BynutBC8_7s
No transcript was found for video jx4wWH9v_EU
No transcript was found for video S5Gpz2jGEuo
No transcript was found for video CN4CbHiNQ_U
No transcript was found for video 6zufJGBd5J8
No transcript was found for video lDMzyJgZS1Q
No transcript was found for video ndWLOLdUlKo
No transcript was found for video BhDkjKa7DPc
No transcript was found for video 2kFekSpGJso
No transcript was found for video EefDYMPuTIg
No transcript was found for video lgWjxnzIUyw
No transcript was found for video qlDwr9-b5T4
No transcript was found for video W9_AqLlvt08
No transcript was found for video 

No transcript was found for video TKitR1T3lkI
No transcript was found for video ODQ1TdAhnMw
No transcript was found for video LuLsbFRUKdU
No transcript was found for video JJKDIN1x1cs
No transcript was found for video TScRfB2zrHs
No transcript was found for video NwT9bxiRpg0
No transcript was found for video GySqiTPJyc4
No transcript was found for video g_G93CzfqPQ
No transcript was found for video eumySw4k18k
No transcript was found for video 44oTO44XHVM
No transcript was found for video hxnHfQZJJrA
No transcript was found for video i-hB9H-HRpg
No transcript was found for video zlCBukYQ-o8
No transcript was found for video TqlbnEMCYG8
No transcript was found for video 7lJacJRBa0k
No transcript was found for video KoyqF4GXSBk
No transcript was found for video LPvvNcnKzaA
No transcript was found for video 8qGFU150O1I
No transcript was found for video FylDFB7kiT4
No transcript was found for video ovXnNeNMjrk
No transcript was found for video ZJGq7VcD6R0
No transcript was found for video 

In [15]:
da_housewife = pd.read_csv("transcripts_housewife.csv").dropna()
da_housewife = da_housewife[da_housewife["transcript"] != "None"]
da_housewife.shape

(258, 2)

In [16]:
da_career = pd.read_csv("transcripts_career.csv").dropna()
da_career = da_career[da_career["transcript"] != "None"]
da_career.shape

(366, 2)

In [17]:
da_vlog = pd.read_csv("transcripts_vlog.csv").dropna()
da_vlog = da_vlog[da_vlog["transcript"] != "None"]
da_vlog.shape

(196, 2)

In [18]:
da_housewife.head()

Unnamed: 0,video_id,transcript
7,8zR6SlxBH-E,[Music] super again [Music] for spicy
8,hDJ6IHRcNM0,again for spicy
11,8de_2Tbrz14,"so if I'm a soccer mom with a spare $200,000 h..."
13,GjP_dtup0CI,[Music] it's the simple things it's the energy
17,zlynun6HedQ,Good morning It's windy today It's a comfortab...
