# HW 7:
This homework we are going to start with a dataset ingest, some cleaning and some visualizations. Then move to Streamlit

## US Census Data

Load the data posted on the github repo "us-population-2010-2019.csv"

In [1]:
import pandas as pd

file_path = "us-population-2010-2019-states-code.csv"
df_population = pd.read_csv(file_path)

In [2]:
df_population.head()

Unnamed: 0,states,states_code,id,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Alabama,AL,1,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
1,Alaska,AK,2,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
2,Arizona,AZ,4,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
3,Arkansas,AR,5,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
4,California,CA,6,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223


Using a lambda function, add a column to the dataframe that provides standard two letter abbreviations to all of the US States. For example, Connecticut would be CT.
https://www.50states.com/abbreviations.htm
*hint* there is a python package called "us"

In [3]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.50states.com/abbreviations.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table that contains the abbreviations
table = soup.find('table') 

# Extract state names and their postal abbreviations into a dictionary
state_abbreviations = {}
for row in table.findAll('tr')[1:]: 
    columns = row.findAll('td')
    state = columns[0].text.strip()
    abbreviation = columns[1].text.strip()
    state_abbreviations[state] = abbreviation

df_population['state_abbreviation'] = df_population['states'].map(state_abbreviations)

In [4]:
df_population.head()

Unnamed: 0,states,states_code,id,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,state_abbreviation
0,Alabama,AL,1,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185,AL
1,Alaska,AK,2,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545,AK
2,Arizona,AZ,4,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717,AZ
3,Arkansas,AR,5,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804,AR
4,California,CA,6,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223,CA


In [5]:
# Since we already have state code in the dataset, we can delete state_abbreviation
df_population = df_population.drop('state_abbreviation', axis=1)

Reshape your data, and create a new df called df_reshaped so that:

1. Convert 'year' column values to integers
2. Convert 'states' to strings
3. Get rid of the commas in the population numbers, and convert them to integers
4. Check your df_reshaped 

In [6]:
# Convert 'states' to strings
df_population['states'] = df_population['states'].astype(str)

# Remove commas in the population numbers and convert them to integers
for year in range(2010, 2020):  # Assuming your dataset includes years from 2010 to 2019
    df_population[str(year)] = (
        df_population[str(year)]
        .str.replace(',', '')  # Remove commas
        .astype(int)  # Convert to integer
    )

In [7]:
# Reshape the dataset
df_reshaped = pd.melt(df_population, id_vars=['states', 'states_code', 'id'], 
                       var_name='year', value_name='population')

# Convert 'year' column values to integers
df_reshaped['year'] = df_reshaped['year'].astype(int)

# Convert 'population' values to integers (if needed, seems already in int format)
df_reshaped['population'] = df_reshaped['population'].astype(int)

In [8]:
df_reshaped.head()

Unnamed: 0,states,states_code,id,year,population
0,Alabama,AL,1,2010,4785437
1,Alaska,AK,2,2010,713910
2,Arizona,AZ,4,2010,6407172
3,Arkansas,AR,5,2010,2921964
4,California,CA,6,2010,37319502


Save your df_reshaped to a csv file

In [9]:
df_reshaped.to_csv('reshaped_us-population-2010-2019.csv', index=False)

Subset your dataframe by selected_year = 2019 year create a new dataframe called df_selected_year


In [10]:
selected_year = 2019
df_selected_year = df_reshaped[df_reshaped['year'] == selected_year]

Sort df_selected_year by population, from highest to lowest, and create a new df called "df_selected_year_sorted"

In [11]:
df_selected_year_sorted = df_selected_year.sort_values(by='population', ascending=False)

In [12]:
# Here is a function to calculate population difference between selected and previous year
def calculate_population_difference(input_df, input_year):
  selected_year_data = input_df[input_df['year'] == input_year].reset_index()
  previous_year_data = input_df[input_df['year'] == input_year - 1].reset_index()
  selected_year_data['population_difference'] = selected_year_data.population.sub(previous_year_data.population, fill_value=0)
  selected_year_data['population_difference_absolute'] = abs(selected_year_data['population_difference'])
  return pd.concat([selected_year_data.states, selected_year_data.id, selected_year_data.population, selected_year_data.population_difference, selected_year_data.population_difference_absolute], axis=1).sort_values(by="population_difference", ascending=False)

df_population_difference_sorted = calculate_population_difference(df_reshaped, selected_year)
df_population_difference_sorted

Unnamed: 0,states,id,population,population_difference,population_difference_absolute
43,Texas,48,28995881,367215,367215
9,Florida,12,21477737,233420,233420
2,Arizona,4,7278717,120693,120693
33,North Carolina,37,10488084,106469,106469
10,Georgia,13,10617423,106292,106292
47,Washington,53,7614893,91024,91024
5,Colorado,8,5758736,67449,67449
40,South Carolina,45,5148714,64558,64558
42,Tennessee,47,6829174,57543,57543
28,Nevada,32,3080156,52815,52815


Filter states with population difference > 50000

In [13]:
df_greater_50000 = df_population_difference_sorted[df_population_difference_sorted['population_difference'] > 50000]

Calculate the % of States with population difference > 50000

In [14]:
total_states = len(df_selected_year)
states_population_difference_above_50000 = len(df_greater_50000)
percentage_population_difference_above_50000 = (states_population_difference_above_50000 / total_states) * 100

print(percentage_population_difference_above_50000)

23.076923076923077


## Plots

### Heatmap: run the following code to see a heatmap

In [15]:
!pip install altair
import altair as alt



In [16]:
alt.themes.enable("dark")

heatmap = alt.Chart(df_reshaped).mark_rect().encode(
        y=alt.Y('year:O', axis=alt.Axis(title="Year", titleFontSize=16, titlePadding=15, titleFontWeight=900, labelAngle=0)),
        x=alt.X('states:O', axis=alt.Axis(title="States", titleFontSize=16, titlePadding=15, titleFontWeight=900)),
        color=alt.Color('max(population):Q',
                         legend=alt.Legend(title=" "),
                         scale=alt.Scale(scheme="blueorange")),
        stroke=alt.value('black'),
        strokeWidth=alt.value(0.25),
        #tooltip=[
        #    alt.Tooltip('year:O', title='Year'),
        #    alt.Tooltip('population:Q', title='Population')
        #]
    ).properties(width=900
    #).configure_legend(orient='bottom', titleFontSize=16, labelFontSize=14, titlePadding=0
    #).configure_axisX(labelFontSize=14)
    ).configure_axis(
    labelFontSize=12,
    titleFontSize=12
    )

heatmap

### Choropleth: Run the following code to get a map of the population for the selected year above

In [17]:
# Choropleth via Altair
!pip install vega_datasets
import altair as alt
from vega_datasets import data



In [18]:
alt.themes.enable("dark")

states = alt.topo_feature(data.us_10m.url, 'states')

alt.Chart(states).mark_geoshape().encode(
    color=alt.Color('population:Q', scale=alt.Scale(scheme='blues')),   # scale=color_scale
    stroke=alt.value('#154360')
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(df_selected_year, 'id', list(df_selected_year.columns))
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)

Pycharm:
Create a pdf reader streamlit app in pycharm using the following code.Upload some pdfs and perform some queries. 
Below capture 5 queries and 5 responses. Check to make sure the responses are accurate. See if you can get your app to fail. Analyze why.

In [None]:
import streamlit as st
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain



def get_pdf_text(pdf_docs):

    text: str = ''
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
           text += page.extract_text()
    return text

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks= text_splitter.split_text(text)
    return chunks

def get_vectorstore(text_chunks):
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
    return vectorstore

def get_conversation_chain(vectorstore):
    llm = ChatOpenAI()
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain

def handle_userinput(user_question):
    response = st.session_state.conversation({'question':user_question})
    st.session_state.chat_history = response['chat_history']
    for i, message in enumerate(st.session_state.chat_history):
        if i%2 ==0:
            st.write(f'Human Question: {message.content}')
        else:
            st.write(f'AI Response: {message.content}')

        
def main():
    load_dotenv()
    st.set_page_config(page_title="Chat with PDFs", page_icon=":scalpel:")

    if "conversation" not in st.session_state:
        st.session_state.conversation = None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history= None

    st.header("Chat with PDFs :medical_symbol:")
    user_question = st.text_input("Ask a question about your documents:")
    if user_question:
        handle_userinput(user_question)

    with st.sidebar:
        st.subheader("Your documents")
        pdf_docs = st.file_uploader("Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
        if st.button('Process'):
            with st.spinner("Processing..."):

            #get pdf texts
                raw_text = get_pdf_text(pdf_docs)
                #st.write(raw_text)

            #chunk pdfs
                text_chunks = get_text_chunks(raw_text)
                #st.write(text_chunks)

            #create vector store with embeddings
                vectorstore = get_vectorstore(text_chunks)

            # create conversation chain
                st.session_state.conversation = get_conversation_chain(vectorstore)


if __name__=='__main__':
    main()


1. Explain the model.

The model discussed in the provided context is centered around the concept of rational addiction. It assumes that addicts are rational individuals who maximize their utility, even though they may be unhappy due to various circumstances. The model recognizes that addiction can stem from events that lower a person's utility, like anxiety-raising situations such as death or divorce.

The model further delves into the dynamics of addictive consumption by exploring steady-state consumption levels and the effects of changes in income and the cost of addictive goods. It also highlights that consumption of addictive goods responds less to temporary price changes compared to permanent ones. Additionally, the model suggests that present and future consumption of addictive goods are complementary, with a person becoming more addicted in the present when expecting events to raise future consumption.

The model also addresses the cessation of severe addictions through "cold turkey," indicating that strong addictions may only end abruptly. It implies that a rational person may choose to end their addiction if events lower their demand for the addictive good significantly or deplete their stock of consumption capital.

Overall, the model of rational addiction presented in the context seeks to explain addictive behavior by considering individuals as rational decision-makers who aim to maximize their utility, even in the case of addiction.

2. Name the authors.

The authors of the model of rational addiction presented in the context are Gary S. Becker and Kevin M. Murphy.

3. What are the application of the model?

The applications of the model of rational addiction can include understanding why individuals may struggle to end addictions like smoking, heroin, or alcohol, and why they may choose to stop consumption abruptly despite short-term discomfort. The model suggests that individuals with addictions make rational decisions based on long-term gains, even if it involves a significant short-term loss in utility. Additionally, the model can provide insights into addictive behavior and how individuals may weigh future preferences when dealing with strong addictions.

4. What kind of econometric tools that are being used by the author?

The author uses econometric tools such as first-order conditions for utility maximization and dynamic aspects of addictive consumption to develop the model of rational addiction. They also consider conditions that determine whether steady-state consumption levels are unstable or stable. Additionally, the author analyzes the variables that determine whether a person becomes addicted to a particular good and the effects of changes in income and the cost of addictive goods on long-run demand. Furthermore, they explore how consumption of addictive goods responds to changes in prices.

5. How can this model assist the policy-makers?

The model of rational addiction can assist policy-makers by providing insights into addictive behaviors. Understanding that addictions can involve forward-looking maximization with stable preferences can help policy-makers design more effective interventions and programs to address addiction issues. By recognizing that rational individuals weigh short-term losses against long-term gains, policymakers can tailor their strategies to help individuals make better choices regarding addictive substances or behaviors. Additionally, the framework can guide the development of policies that consider the impact of small changes in the environment on addiction initiation or termination. Ultimately, the model of rational addiction can inform evidence-based policy-making in the field of addiction prevention and treatment.