<h1 align="center">Data Science II | Final Project</h1> 
&ensp;

<h3 align="center">Alvaro Montoya</h3> 

<h3 align="center">05/07/2022</h3> 



#### I. Introduction

With causes varying from the pandemic economic rebound to the war in Ukraine, inflation is soaring all over the world in 2022. From advanced economies to low-income nations, [most countries are suffering](https://blogs.worldbank.org/voices/return-global-inflation#:~:text=In%2015%20of%20the%2034,in%20more%20than%2020%20years.) from inflationary trends. Inflation is a key economic indicator because it relates directly to people's well-being. The current inflationary push is thus eroding most people's purchasing power. Therefore, monitoring inflation is important, especially for open economies that are price-takers, as most of the developing world is. 

Most inflation monitoring in the developing world is still based on surveying key retail sectors every month and matching these prices with a pre-defined (based on consumption-expenditures surveys) set of factor weights for each product. This results in lagged versions of price monitoring, in contrast with the possibility of tracking prices in real-time. With this background in mind, I sought to develop a project to monitor prices in developing countries using web scraped data from the online version of supermarket chains. For the pilot stage of the project, I chose to download grocery price data from Walmart Central America. The regional activities of the retail giant in Central America provide similar websites for Mexico, Guatemala, El Salvador, Honduras, and Nicaragua. 


#### II. Background and Motivation

This document presents the first results with [data from Nicaragua](https://www.walmart.com.ni/). This country offers an ideal case study for at least two reasons: it is the second-poorest country in the American continent after Haiti, and it is governed by a dictatorial regime that has manipulated official statistics in the past. In other words, since Walmart provides relatively economical shopping, I expect a greater Walmart market share for Nicaraguans. The second motive is in my opinion the most important because there are reasons to believe the Ortega regime is [actively hiding the current price hike](https://www.laprensani.com/2022/04/05/economia/2977062-regimen-de-ortega-intenta-ocultar-el-aumento-de-precio-de-la-canasta-basica). Moreover, from personal experience I know Nicaragua is a country with considerable geographical availability of Walmart and subsidiary stores.

Nicaragua, the biggest country in Central America, is a small and open economy, constantly weakened by external shocks that erode the purchasing power of its poorest citizens. According to economist Nestor Avendaño, the current rising inflation is not explained by excess demand but corresponds to [external price shocks](https://nestoravendano.wordpress.com/2022/04/10/la-inflacion-ya-es-un-problema/). Some of his arguments include the fact that since the 2018 political crisis, Nicaragua's Central Bank has followed a contractive monetary policy, while the Central American economy has suffered from a steep recession since. Without a single causal stream, inflation is a reality for most sectors of the economy, and it is particularly high for food prices (Figure 1). Using official Consumer Price Index (CPI) data, Figure 1 depicts a surge of prices at two-digit rates, with an interannual inflation rate of almost 14% for food prices in March 2022, the last month with available data. It is worth noting that the official inflation estimates are published each month by Nicaragua's statistical institute INIDE (Instituto Nacional de Información para el Desarrollo, in Spanish).

<h3 align="left">Figure 1. Nicaragua: Recent inflation trends</h3>
<img align = "center" src="/../assets/fig1.png" width="600" img>

Interestingly, INIDE also publishes reports on the Basic Consumption Basket (BCB, Canasta básica in Spanish), which contains [53 consumption goods](https://www.inide.gob.ni/Home/canasta) considered essential for the monthly living expenses for a stereotypical family of four members. In Nicaragua, monitoring the evolution of BCB is widely known by the population and extensively covered by the few free media outlets left in the country. The [firing of the Economics Minister in 2020](https://forbescentroamerica.com/2020/04/13/el-ministro-de-economia-de-nicaragua-no-sabe-el-precio-de-la-canasta-basica/) because he did not know the prices within the BCB at the moment is testament to this relevance. For this project, I chose to replicate Nicaragua's BCB in terms of representative goods that are equivalent to those included in the official BCB and also found in the scrapped data. 

Figure 2 provides a composition portray of this official consumption basket with the relative importance given by its monthly cost distribution for March 20202. During March the average cost of the 53 items within Nicaragua's BCB rose to 16,998 Córdobas, approximately 475 US dollars. A brief overview of Figure 2 tells how food items were the primordial component in the total basket value during March. Indeed, the cost of acquiring food items corresponds to 70% of the total BCB. Products like tortillas, milk, and beef were the main representatives of the typical Nicaraguan diet as depicted by the BCB. 

<h3 align="center">Figure 2. Nicaragua: Official monthly consumption basket breakdown for march 2022 </h3>
                        <img src="/../assets/fig2.png" width="600" align = "center">

It is worth noting that while prices vary month to month, the 'consensus' quantity of consumption for each product is fixed (does not change over time) and determined by INIDE. To cite a few examples, for this typical family of four (two adults + two children), INIDE calculates a monthly consumption need of 30 liters of milk, 9 pounds of cheese, or 7 dozens of eggs. These quantities are based on caloric consumption needs calculated in turn by [Central America's Institute of Nutrition](http://incap.int/index.php/es/). This leads to one of the project's main premises: that using Walmart as the source of price movements and INIDE's item quantities you can potentially recreate Nicaragua's BCB in real-time. In this vein, the next section introduces the data collection method, its functionalities, and main code components.


#### II. Data and Methods

The data for replicating Nicaragua's BCB relies on the regular update of web scraped price information from Walmart.com.ni. In particular, I used a Selenium Chrome driver to download .csv files for each country on a weekly basis. The web scraping code crawls through the store items and downloads their name, price, weight, and discount information. Using these indicators, a further process of Natural Language Processing (NLP) helped me generate harmonized variables, and create a time series of the downloaded datasets. The following code snippet serves as the programming required to download price data for articles distributed in 6 global categories (Staples, Meats, Dairy, Personal Goods, Pharmacy, and Household Articles). I started writing the code in January-February 2022, and so far the code has evolved to include more categories and countries (I started with two countries and three categories). For Nicaragua, and on average, each weekly download results in over 1,500 individual items.

I chose Selenium over other libraries because of its capacity to perform well with asynchronous websites. This web scraping program functions as follows: after defining a list of product categories and countries, it loads each country's website and runs through the first 50 pages of the most popular sold items (this popularity measure is defined by Walmart) and scrolls at the end of each page to load the data asynchronously (lazy-loading). This is because these websites, while using numbered pages, have infinite loading (as in Instagram and similar social media sites). Therefore, the crawler needs to continuously scroll down Walmart’s page to load the asynchronous ajax. To avoid server rejections or over-traffic, I added timers for the program to sleep between loading sections. All indicators are then appended to the empty list 'texto'. The repository for the code is [here](https://github.com/AlvaroAltamiranoM/Walmart_Consumer_Price_Index), and this [video](https://www.youtube.com/watch?v=LPIkCh4hQmQ) is a showcase of the scrapping snippet at work. The repository includes a folder of coding scripts ('program'), one for saving the figures as .png files ('assets'), and another for keeping the used datasets ('data').

<code>
texto = []
countries = ['ni','hn','sv','gt','mx']
section = ['abarrotes','carnes-y-pescados','lacteos','higiene-y-belleza','farmacia', 'articulos-para-el-hogar']
for cc in countries:
    for pages in range(1,50):
        for sec in section:
            URL = 'https://www.walmart.com.{0}/'.format(cc)+format(sec)+'?order=OrderByTopSaleDESC&page='+format(pages)
            print(URL)
            driver.get(URL)
            time.sleep(5)
            try:
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(10)
                element_present = EC.presence_of_element_located((By.XPATH, '/html/body/div[2]/div/div[1]/div/div[5]/div/div/section/div[2]/div/div/div/div[4]/div/div[2]/div/div[6]/div/div/div/div[2]/div[24]'))
                WebDriverWait(driver, 30).until(element_present)
                results = [i.text for i in driver.find_elements_by_xpath('/html/body/div[2]/div/div[1]/div/div[5]/div/div/section/div[2]/div/div/div/div[4]/div/div[2]/div/div[6]/div/div/div/div[2]/div')]            
                results.append(str(sec))
                print(results)
                texto.append(results)
                time.sleep(1)
            except:
                break
            pass
        texto.append(results)
<code>
    
Once the information is saved in relational database format, I obtained the BCB’s consensus quantities for each product from Nicaragua’s INIDE's website. For this task, I developed different measures of text similarity as a first-tier filter for identifying which items to include in the index, among the thousands downloaded each time. Specifically, I experimented with the Difflib, FuzzyWuzzy, and RecordLinkage string matching and text similarities libraries. These libraries, along with a simpler approach using keyword search within Pandas' matching functions (e.g. pandas.Series.str.contains) helped me identify 21 Official BCB items recurrently found in the web scraped database. It is worth mentioning that in the final stage of the NLP I chose the 21 items (less for the first months of script built-up; e.g. January/February) based on manually comparing price and weight information of about 240 items. These 240 were the selection product of previously mentioned matching libraries and NLP heuristics. Part of the final selection was based on prior knowledge (as a consumer) about common brands and product variants offered by Walmart in Nicaragua. For the most part, these are food items, but also include personal hygiene and household articles. The next section introduces statistics about the price evolution of these BCB items.


#### III. Main Findings

Initially, the main objective for gathering this data was to track consumer prices in real or quasi-real-time, and match price movements under the official CPI or BCB methodologies. For this initial stage when there is still no time series covering a longer time span, I will focus on replicating Nicaragua's BCB instead of Nicaragua's CPI. This is because the CBC methodology is more parsimonious and as the results in this section suggest, the BCB is more clearly represented in the products available at Walmart stores. The 21 articles discussed above integrated 87% of the food component of Nicaragua's official BCB cost for March 2022 (60% of the total basket value for this month). In quantity (weight) they also represent about 60% of the overall consumption consensus.

Figure 3 is a stacked bar plot illustrating the cost of buying the minimum quantities of each of the BCB goods identified in the scraped files. Each value results from multiplying INIDE's consensus quantities by the average monthly price for each of these products on Walmart's site. As time passes from January to April each column has more articles. This is explained by the process of code iteration during the first months of the program creation, as I mentioned earlier. It is since April that I considered the code in beta stage because by then I had identified all available Walmart categories for nesting each item. A reading of Figure 3 shows that in April these 21 products added to 10,131 Córdobas (around US$ 280). As a reference, the same 21 articles summed up C$ 10,239 in INIDE's March 2022 CBC. 

The gaps or price differences can be explained by the fact that INIDE's price surveys are not as narrow as mine; which has Walmart as the sole source. This determines which products I find, their prices and packaging, all factors affecting each product's price trends, and comparability with the official items inspected regularly by INIDE. The gap, however, is very small, equivalent to 3 US dollars. Comparing both sets of price data (INIDE's March 2022 with my webscraped data for April 2022) results in an implied monthly deflation of 1.7% for April 2022. This apparent contradiction, since we know officially that inflation is climbing up in Nicaragua, is again a result of collecting price data from a single retailer. In this case, Walmart is known internationally as an economical retail store. I hope to continue downloading data for the rest of 2022 to confirm current inflation patterns. This exercise will hopefully help me find other explanations for this trend. Importantly, since I cannot gather price information for cooking gas and transportation from Walmart, I cannot asses their contribution to the BCB inflation estimate for April 2022 (I suspect these items to have risen significantly since the war in Ukraine started).

<h3 align="left">Figure 3. Monthly cost of consumption goods found at Walmart.com.ni </h3>
<img src="/../assets/fig3.png" width="600" align = "center">

A final *curiosity* calculation produces a simple top 15 ranking of items according to their popularity (most sold). This classification of best-selling Walmart-Nicaragua articles is presented in Figure 4. This Figure corroborates my expectations (priors) as a Nicaraguan over which items are most regularly bought at Walmart stores: corn flour to make tortillas, rice, beans, cookies, sugar, etc. These rankings could be used for prioritizig social and nutrition policy, given the popularity of Walmart itself. More interesting than the rankings, however, is the ability to extract consumer price data from Walmart sites for several countries. Given the prevalence and popularity of the retail giant in these countries, it may serve as a useful, albeit imperfect source of consumer wellbeing. More on this in the final section ahead.

<h3 align="left">Figure 4. Walmart-Nicaragua: Top Selling Grocery Items (sorted by unit price) </h3>
<img src="/../assets/fig4.png" width="600" align = "center">


#### IV. Discussion and Limitations

In this project, I built a Selenium web crawler to extract price information from the website of retail stores in Mexico, Guatemala, Honduras, El Salvador, and Nicaragua. The ultimate aim of this project was to assess whether data from retail stores can be used to create measures of consumer well-being in low and middle-income countries. This document presented the overall background, approach, and results for one country, Nicaragua. The results for Nicaragua indicate that despite limitations, it is possible to recreate official consumer statistics based on information extracted programatically from online retail stores. Despite its limitations, I believe this type of project is scalable, easily replicable, and most importantly, provides a good source of consumer prices for countries where authoritarian regimes may be misrepresenting or hiding inflation statistics.

In the future, I plan to continue populating the series with bimonthly downloads of new items and prices for the five countries. This time series of foods and beverages will be the basis for projecting BCB costs, and also for creating a Walmart CPI index for Nicaragua. I will obtain the Index’s weight for each product from Nicaragua’s CPI methodology; if not to make the series comparable, at least to follow the same end-result coefficients taken from income-expenditure household surveys that helped build it. Until the end of 2022, hopefully, I will have a long enough series to do some time series decomposition analysis in Python. I also plan to complement this information with research to assess the external validity of my main results, focusing on understanding Walmart's approximate market shares. This could potentially be addressed by including other retail stores in the project.

