# data_prep.ipynb

Name: Marvin Limpijankit  
Email: ml4431@columbia.edu  
  
This notebook covers the entire process of scraping the data and preprocessing it before prior to analysis. It is separated into modules so that certain sections can be ran with slight changes to the parameters for ease of reproducibility and further extension. In my analysis, I collect 100 images from the search term "Ukraine" from 6 websites (3 Chinese, 3 Western), which are in the */data* directory of this repository. 

In [4]:
# import packages
import sys

In [5]:
# import functions
sys.path.insert(0, '/Users/marvinlimpijankit/Documents/GitHub/COMS4901-spr2023/lib')

from image_scraping import google_image_scrape
from image_normalization import find_smallest_img_size, normalize_imgs

**google_image_scrape** is a method that takes (search_term, website_domain, num_of_images, output_path) as inputs and saves the top *num_of_images* images with the *search_term* restricted to a certain *website_domain* at *output_path*. The images are saved into the path: "*output_path/[website_domain].[search_term]*"
  
An example call is shown below

In [3]:
google_image_scrape('Ukraine', 'nytimes.com', 10, '../data')

*The result of running the above can be found in the 'data' folder. For ease of readability, the results from running the above for the six websites in this analysis, manually filtering for duplicates, and filtering only .jpg images are saved in data in two folders titled, 'China' and 'Western' which contains sub-folders corresponding to each of the news media sources containing 100 images each.*

Once the raw data is ready, we parse through the images, finding the smallest dimension in both the x and y direction in order to normalize them into smaller, square images. This step is done for color analysis, as we want each image to contain the same number of pixels and it is easier to work with same-size, smaller images. 

*it is possible to run the analysis on the native sizes but all methods would need to additionally account for size 

In [4]:
# store a list of website names, along with their region
websites = [('CNN', 'US'), ('NBC', 'US'), ('NYT', 'US'), ('China_Daily', 'China'), 
            ('People\'s_Daily', 'China'), ('Xinhua_News_Agency', 'China')]

# find the smallest x, y dimensions across all images
find_smallest_img_size('../data', websites)

The minimum width across all images is: 156 pixels
The minimum height across all images is: 166 pixels


(156, 166)

From the result above, we'll choose to reduce the image size to a square image of size (128, 128) and save this into the 'output' folder, where we keep intermediate data files produced in this analysis. The *normalize_imgs* method also renames each image to img_0.jpg - img_100.jpg for each website, making it easier to locate them for later qualitative analysis. 

In [5]:
# normalize all the images in a 128x128 square format
normalize_imgs('../output', '../data', websites, (128, 128))

Images have been normalized to size (128, 128) and saved at path "../output/normalized(128, 128)".


Now that we have our dataset and normalized images data prep is now concluded and we can move forward with different parts of the analysis which are covered in their respective .ipynb files in the doc directory. 