# DS 2023: Communicating with Data
## Final Project, Notebook A: Data Acquisition
### Cameron Berryman - kqe6rf

1. Acquire and read in the data source(s).
2. Describe how to get the data.
3. Describe who produced the data and how.
4. Describe the data ºs features with a COLS table.

### Setup

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

### Part 1: Acquire Data
The data that I used came from multiple sources. See below:

In [22]:
WORDLE = pd.read_csv('wordledata.csv').set_index('puzzle_id').sort_index()
WORDLE['puzzle_date'] = pd.to_datetime(WORDLE['puzzle_date'], format="%m/%d/%Y")
WORDLE['word_first_use'] = pd.to_numeric(WORDLE['word_first_use'])
WORDLE.head()

Unnamed: 0_level_0,puzzle_date,puzzle_answer,word_first_use,word_etymology,word_year_accurate
puzzle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2021-06-19,CIGAR,1730,Spanish cigarro,True
1,2021-06-20,REBUT,1300,"Middle English, from Anglo-French reboter , fr...",False
2,2021-06-21,SISSY,1879,sis,True
3,2021-06-22,HUMPH,1803,imitative of a grunt,True
4,2021-06-23,AWAKE,1100,Middle English awaken (from Old English awacan...,False


#### Data Sources
- puzzle_id, puzzle_date and puzzle_answer columns are all taken from the HTML of [WordFinder](https://wordfinder.yourdictionary.com/wordle/answers/) which is a website that has several tools for helping with word-based games like Scrabble, Words with Friends, and Wordle. This webpage that I scraped from is written by the team that made the website and has all prior answers to the daily Wordle puzzle. If you live under a rock, [Wordle](https://www.nytimes.com/games/wordle/index.html) is a game that has a single 5-letter word to guess once per day. The data was scraped on November 6th, 2025 and has all answers from that day and prior.
  - The **puzzle_answer** column is what the word was that day.
  - The **puzzle_date** column is what day the puzzle was active.
  - The **puzzle_id** column is the ordinal number of which puzzle that day was, starting from 0 (the 0th puzzle is the 1st Wordle puzzle, which was CIGAR.)
- The word_first_use and word_etymology columns are derived from the HTML of the specific [Merriam-Webster Dictionary](https://www.merriam-webster.com/) page for the word in the **puzzle_answer** column for that row. I used the Python script [etymgroup2.py](etymgroup2.py) to scrape this data, operating based on the file [wordle_tables_clean.csv](wordle_tables_clean.csv). The Merriam-Webster online dictionary is written by lexicographers (professional researchers on the meanings and origins of words to write a dictionary, a blend of linguist and etymologist) for use by the general public.
  - The **word_first_use** column contains the year that the word was allegedly first used in. If the first used area in Merriam-Webster listed a century it was first used instead of a specific year, the year is labeled as the start of that century (Ex. 13th century is labeled as 1200). If no data could be found on the webpage, the year is labeled as -1.
    - The **word_year_accurate** column contains True only if the tool found a specific year. It contains False if the tool either used a century or if the first used data could not be found on the webpage. This derives from the data on the webpage but is made by me/the tool.
  - The **word_etymology** column contains the description of the word's etymology (origins) directly taken from Merriam-Webster stripped of its HTML formatting. If no data is found, the tool used the **word_first_use** column to try to infer the word's origin assuming it is some variant of modern English. If the year was 1800 or later, it is labelled as "modern english". If the year was between 1450 and 1800, it is labeled as "early modern english". Otherwise, if no inference can be made, it is labeled as "unknown".

In [37]:
cols_dict = {"ColumnName":list(WORDLE.columns), 
             "DataType" : list(WORDLE.dtypes),
             "DataScale" : ["Interval", "Nominal", "Interval", "Nominal", "Nominal/Boolean"],
             "Minimum" : list(WORDLE.min()),
             "Maximum" : list(WORDLE.max()),
             "UniqueCount" : list(WORDLE.nunique())}
COLS = pd.DataFrame(cols_dict)
COLS.to_csv("wordledata_cols.csv")
print(COLS)

           ColumnName        DataType        DataScale              Minimum  \
0         puzzle_date  datetime64[ns]         Interval  2021-06-19 00:00:00   
1       puzzle_answer          object          Nominal                ABACK   
2      word_first_use           int64         Interval                   -1   
3      word_etymology          object          Nominal     American Spanish   
4  word_year_accurate            bool  Nominal/Boolean                False   

               Maximum  UniqueCount  
0  2025-11-06 00:00:00         1602  
1                ZESTY         1602  
2                 1992          315  
3         wide entry 1         1386  
4                 True            2  
