# get_CottonQualityData_from_NASS

**Project:** Texas and Georgia Agriculture

**Date:** March 04, 2025

**Code Contact:** Kelechi Igwe, [igwekelechi.e@gmail.com]

**Inputs:** 
> Folder path to where all output CSV files should be stored.
        
**Outputs:** A CSV file containing weekly cotton quality data for Texas and Georgia, over the number of years available on the website.

**Description:** 

This script will extract Cotton quality data for all available years and by districts for Texas and Georgia states from the National Agricultural Statistics Service (NASS) website here: https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/


In [1]:
# Install the libraries below, if not already installed

#!pip install wget
#pip install wget, requests, os

In [2]:
# Import all required libraries

import os
import pandas as pd
#import wget # this does the actual download
import re #This collects all links within a text
import requests # access the website
from urllib.parse import urljoin, unquote #to join url links together
import chardet # access the encoding of a text file
import json # temperarily save file in this format before using it.
import io # to read web object as file, temporarily
import pickle

In [52]:
# Insert the link to the website below

url = 'https://apps.ams.usda.gov/'
base_url = 'https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/'

## Section 1: Save URL links of weekly files for each year into a dictionary 

In [53]:
# Make a request to collect all website information from the provided link

r = requests.get(base_url)
if r.status_code != 200: # status code 200 means access is sucessful
    print("Website url is bad")
    exit()

# Collect links representing all the years available into a list
links = re.findall(r'href=["\'](.*?)["\']', r.text, re.IGNORECASE)

In [54]:
links

['/Cotton/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2015%20Crop/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2016%20Crop/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2017%20crop/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2018%20Crop/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2019%20Crop/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2020%20Crop/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2021%20Crop/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2022%20Crop/',
 '/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2023%20Crop/']

In [55]:
'''
THIS PART OF THE SCRIPT WILL EXTRACT THE LINKS TO ALL THE WEEKS AVAILABALE IN A YEAR WHICH WILL BE USED TO MAKE A DOWNLOAD REQUEST 
'''
# Input any years you already have data for, otherwise leave the list empty
years_available = [2014, 2013, 2025]

# store all weeks for each year here 
dictionary_of_years = {}

# Loop through all the listed links extract the links to the specific year folders
for link in links:
    
    files_in_year = None # Initialize the value

    # Only select links that are actual cotton folders
    if link != "/Cotton/":
        
        # Get the current year
        year = int(link.split('/')[-2].split('%20')[0])

        # Check if current year is already in the list of downloaded years (might take out this part later)
        if year not in years_available:
            # Create a link for the current year which has not been downloaded
            files_in_year = urljoin(base_url, link.split('/')[-2])
            
            # Check if the created link is valid before continuing on
            if files_in_year: # Code below will only run if there are links in files_in_year
                
                # Get the folder name (this will be used to create a folder to store the weekly text files)
                folder_name = unquote(os.path.basename(files_in_year))
                
                #print('Saving: ', folder_name) # This is to keep track of which year is being processed.
            
                # Make a web request to get all the week files in the current year
                r2 = requests.get(files_in_year)
                if r2.status_code != 200: # status code 200 means access is sucessful
                    print("Website url is bad")
                    exit()
            
                # Collect all the links representing all weeks for the current year into a list
                weeks_links = re.findall(r'href=["\'](.*?)["\']', r2.text, re.IGNORECASE)
            
                # Loop through the list of weekly data and collect the links to each week in a list
                list_of_weeks = []
                for week in weeks_links:
                    if week != "/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/":
                        current_week = urljoin(url, week)
                        filename = unquote(os.path.basename(current_week))
            
                        # Attach the current week to the list of weeks
                        list_of_weeks.append(current_week)
            
                # Append the list of weeks in the current year to its corresponding year in the dictionary
                dictionary_of_years[folder_name] = list_of_weeks
            
            # print links for year in dictionary_of_years
            print(files_in_year)
        

https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2015%20Crop
https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2016%20Crop
https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2017%20crop
https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2018%20Crop
https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2019%20Crop
https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2020%20Crop
https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2021%20Crop
https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2022%20Crop
https://apps.ams.usda.gov/Cotton/Weekly_Cotton_Quality_Data_files_by_NASS_Ag_Districts/2023%20Crop


In [48]:
# After confirming the the links for all weeks in a year are in the dictionary, go to section 3
#dictionary_of_years

## Section 2: this section will download all the raw files for the links in the dictionary above (Please see note!!!!)
**NOTE:** if you want to extract for only Texas and Georgia before downloading files, skip this section!

In [None]:
for year,files_list in dictionary_of_years.items():
    print('processing: ', year)
    
    # Create a folder for the current year if it doesn't exist already
    os.makedirs(year, exist_ok=True)

    for file in files_list:
        # Get file title for the current week
        filename = unquote(os.path.basename(file))
    
        # Uniquely name the textfiles before saving to folder
        text_file_path = os.path.join(year, filename)

        
        # Get the content of the single week's file
        r3 = requests.get(file)
        
        
        # Write the web results to text file
        if r3.status_code == 200:
            with open(text_file_path, "wb") as entry:
                entry.write(r3.content)
                
            print(f'{text_file_path} completed!')
            
        else:
            print('Falied to access link')
            exit()

## Section 3: Run this part to extract data for only Texas and Georgia

In [56]:
'''
THIS PART OF THE SCRIPT WILL START DOWNLOADING FILES FOR EACH WEEK IN THE YEARS AVAILABLE IN 'dictionary_of_years'
'''

for year,files_list in dictionary_of_years.items():
    print('processing: ', year)
    
    # Create a folder for the current year if it doesn't exist already
    os.makedirs(year, exist_ok=True)

    for file in files_list:
        # Get file title for the current week
        filename = unquote(os.path.basename(file)).split('.')[0]
    
        # Uniquely name the textfiles before saving to folder
        text_file_path = os.path.join(year, filename)

        # Get the content of the single week's file
        r3 = requests.get(file)
        
        
        # temporarily write the requested data to pickle format
        with open('temp.pickle', 'wb') as f:
            pickle.dump(r3.content, f)
        
        # Read the pickle data
        with open('temp.pickle', 'rb') as f:
            data = pickle.load(f)
        
            # get the encoding of the file (useful to read as csv)
            preview = data[:100000] # preview first 100 KB of data and get the file encoding format
            file_encoding = chardet.detect(preview)['encoding']
        
            # The variable 'data' is of type 'bytes', but we need it as a file object to read it as csv
            file_object = io.BytesIO(data)
        
        # Now we can read it as csv
        df = pd.read_csv(file_object, delimiter = '\t', encoding = file_encoding, on_bad_lines = 'skip')
        #print(df.head())
        
        # To see all the the states contained in the file
        unique_dist = df['State-NASS District Number'].unique()
        
        # Get only data for Texas and Georgia (that is, if the column for 'State-NASS District Number' contains TX)
        df_new = df[df['State-NASS District Number'].str.contains('GA-70') | df['State-NASS District Number'].str.contains('GA-80') | df['State-NASS District Number'].str.contains('GA-90')]
        
        # Save the new data as text in the newly created folder
        df_new.to_csv(f'{text_file_path}.csv', sep=',', index=False)
    
print(f"Script run is successful! Here's how the {year} data looks: \n", df_new.head())

processing:  2015 Crop
processing:  2016 Crop
processing:  2017 crop
processing:  2018 Crop
processing:  2019 Crop
processing:  2020 Crop
processing:  2021 Crop
processing:  2022 Crop
processing:  2023 Crop
Script run is successful! Here's how the 2023 Crop data looks: 
 Empty DataFrame
Columns: [Week Beginning, Week Ending, State-NASS District Number, Official Color Grade, Leaf Grade, Extraneous Matter, Remarks, MIKE, Strength, HVI Color Code, HVI Color Quad, HVI Color RD, HVI Color +b, Trash % Surface, Length 100ths, Length Uniformity, Filtered out at 11:12:01 AM]
Index: []


In [30]:
# Check to see where the files are saved
os.getcwd()

'C:\\Users\\kelechi\\OneDrive - Kansas State University\\Desktop\\Research Resources\\Conferences\\NASA_DEVELOP\\Scripts'