# Where Should I Build My Comic Book Studio?

I've always wanted to have my own comic book line, however I cannot draw so I would need to hire multiple people tp do that for me. Thus I need to build a studio, but where is the best place to do that?

In this project, I use the Pushshift Reddit API to search for comments about popular superheros in 58 city-based subreddits to determine where the highest concentration of discussion to lowest subreddit population is.

In [1]:
# needed imports
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from sklearn.preprocessing import MinMaxScaler

## The Data

This project uses the Pushshift Reddit API and the FiveThirtyEight comic character dataset. Any other data was collected and cleaned on my own. I make use of Pushshift's API due to having a total result counter. This allows me to do my searching in just O(n^2) rather O(n^3) with the Reddit official API. I also created a list of city subreddits, however this data is in a messy format and must be converted to CSV.

In [2]:
file1 = open('data.txt', 'r')
Lines = file1.readlines()
Lines

['r/NYC - 196,227 members\n',
 '\n',
 'r/Seattle - 174,478 members\n',
 '\n',
 'r/LosAngeles - 165,432 members\n',
 '\n',
 'r/Chicago - 160,732 members\n',
 '\n',
 'r/Austin - 146,721 members\n',
 '\n',
 'r/Portland - 139,972 members\n',
 '\n',
 'r/SanFrancisco - 137,723 members\n',
 '\n',
 'r/Boston - 134,162 members\n',
 '\n',
 'r/Houston - 132,998 members\n',
 '\n',
 'r/Atlanta - 118,570 members\n',
 '\n',
 'r/Philadelphia - 114,199 members\n',
 '\n',
 'r/Denver - 113,338 members\n',
 '\n',
 'r/SeattleWa - 103,420 members\n',
 '\n',
 'r/Dallas - 93,039 members\n',
 '\n',
 'r/WashingtonDC - 87,856 members\n',
 '\n',
 'r/SanDiego - 87,145 members\n',
 '\n',
 'r/Pittsburgh - 67,386 members\n',
 '\n',
 'r/Phoenix - 65,370 members\n',
 '\n',
 'r/Minneapolis - 55,519 members\n',
 '\n',
 'r/Orlando - 52,588 members\n',
 '\n',
 'r/Nashville - 50,995 members\n',
 '\n',
 'r/StLouis - 46,800 members\n',
 '\n',
 'r/SaltLakeCity - 46,643 members\n',
 '\n',
 'r/Columbus - 46,600 members\n',
 '\n'

In [3]:
# Cleaning the data set to a usable form
# File used to get subreddit names and member numbers
# needs to be formatted.
count = 0
names = []
members = []
# Strips the newline character
for line in Lines:
    count += 1
    if count % 2 != 0: # original data.txt only has data on odd lines
        row = str(line.strip()).replace(",", "")
        row = row.replace(" - ", ",")
        row = row.replace(" members", "")
        row = row.replace("r/", "")
        names.append(row.split(",")[0])
        members.append(int(row.split(",")[1]))
names, members

(['NYC',
  'Seattle',
  'LosAngeles',
  'Chicago',
  'Austin',
  'Portland',
  'SanFrancisco',
  'Boston',
  'Houston',
  'Atlanta',
  'Philadelphia',
  'Denver',
  'SeattleWa',
  'Dallas',
  'WashingtonDC',
  'SanDiego',
  'Pittsburgh',
  'Phoenix',
  'Minneapolis',
  'Orlando',
  'Nashville',
  'StLouis',
  'SaltLakeCity',
  'Columbus',
  'Raleigh',
  'NewOrleans',
  'Tampa',
  'KansasCity',
  'rva',
  'Charlotte',
  'Baltimore',
  'Detroit',
  'Vegas',
  'Indianapolis',
  'Cincinnati',
  'Miami',
  'Boulder',
  'Sacramento',
  'MadisonWi',
  'SanAntonio',
  'Cleveland',
  'Milwaukee',
  'Louisville',
  'Chattanooga',
  'LasVegas',
  'Buffalo',
  'Tucson',
  'Rochester',
  'FortWorth',
  'Albuquerque',
  'Charleston',
  'Tulsa',
  'Memphis',
  'Jacksonville',
  'Knoxville',
  'Albany',
  'bullcity',
  'DesMoines'],
 [196227,
  174478,
  165432,
  160732,
  146721,
  139972,
  137723,
  134162,
  132998,
  118570,
  114199,
  113338,
  103420,
  93039,
  87856,
  87145,
  67386,
  653

In [4]:
# Convert the data to dictionary, then dataframe, then finally CSV
dict = {"NAME": names, "MEMBERS": members}
subreddit_data = pd.DataFrame(dict)
subreddit_data.to_csv('subreddit_data.csv')
subreddit_data.head()
subreddit_data.dtypes

NAME       object
MEMBERS     int64
dtype: object

In [5]:
# Load the comic character data from FiveThirtyEight
dc_comics = pd.read_csv('dc-wikia-data_csv.csv')
marvel_comics = pd.read_csv('marvel-wikia-data_csv.csv')
comics = dc_comics.append(marvel_comics)
comics = comics.reset_index()
comics.columns = comics.columns.str.strip()

In [6]:
# Example pushift api request
r = requests.get('https://api.pushshift.io/reddit/search/comment/?q=science&subreddit=askscience&size=1&metadata=true')
data = r.json()
print(data)

{'data': [{'all_awardings': [], 'associated_award': None, 'author': 'Hapankaali', 'author_flair_background_color': None, 'author_flair_css_class': None, 'author_flair_richtext': [], 'author_flair_template_id': None, 'author_flair_text': None, 'author_flair_text_color': None, 'author_flair_type': 'text', 'author_fullname': 't2_ynsha', 'author_patreon_flair': False, 'author_premium': False, 'awarders': [], 'body': 'Let me give you a more general answer than most of the answers here.\n\nYes, there are chaotic systems in the human body. A better question might be: are there non-chaotic systems in the human body? The answer to that is probably "no" unless you define "system" very narrowly.\n\nRoughly speaking, when a system is not chaotic, we call it *integrable*. An integrable system is one where you can find the solution to the dynamics of the system exactly without error, sometimes with analytical means. A simple pendulum is an integrable system. You spend much of your high school scienc

In [7]:
# Get data from each subreddit for each keyword for most popular comic character
# by number of appearances
comics = comics.sort_values(by="appearances", ascending=False, ignore_index=True)
count_list = []
hero = 0
name = 0
# We'll use comics.name[:10] to get the 10 most popular characters, but this number can be changed
df = pd.DataFrame(columns=subreddit_data.NAME, index= comics.name[:10])

## Running the API

The next block is optional to run. It is a lot of data to process and I have included the result already. If you're fine with the top 10 most popular characters, then there is not point running the block I commented out below.

In [8]:
# This nested loop handles the api calls. If a timeout/error occurs, the loop index won't 
# update since it's a while loop

# while name < len(subreddit_data.NAME):
#     while hero < len(comics.name[:10]):
#         print("HERO NAME: " + comics.name[hero])
#         print("CITYNAME: "+subreddit_data.NAME[name])
#         # the comics have this format of hero name followed by their ID, this removes ID
#         head, sep, tail = comics.name[hero].partition('(')
#         print(head)
#         r = requests.get(f'https://api.pushshift.io/reddit/search/comment/?q={head}&subreddit={subreddit_data.NAME[name]}&size=1&metadata=true')
#         if r.status_code == 200:
#             data = r.json()
#             count_list.append(data['metadata']['total_results'])
#             hero += 1
#             print("Loaded " + str(hero) + "/" + str(len(comics.name)))
#             # print(count_list)
#     df[subreddit_data.NAME[name]] = count_list
#     name += 1
#     print("Name value "+str(name))
#     count_list.clear()
#     hero = 0
# df.to_csv("comic_mention.csv")

## Results

We need a way to score the cities. A high number of occurances is good, but this is bad if the member population is high. Our ideal score is something that is high in occurances but low in population. Thus we will divide by subreddit member population to weigh each of the results.

In [9]:
# Take the total counts and add them together for each city
# Then divide by subreddit member numbers
total = ['total']
comic_mention = pd.read_csv("comic_mention.csv")
for city in subreddit_data.NAME:
    # We can scale the scores up by multiplying by 10000, just to make the data look a little nicer. Just did percent here
    comic_mention[city] = (comic_mention[city] / subreddit_data[subreddit_data.NAME == city].MEMBERS.values)*100
    total_mention = np.sum(comic_mention[city])
    total.append(total_mention)

In [10]:
a_series = pd.Series(total, index=comic_mention.columns)
comic_mention = comic_mention.append(a_series, ignore_index=True)
comic_mention

Unnamed: 0,name,NYC,Seattle,LosAngeles,Chicago,Austin,Portland,SanFrancisco,Boston,Houston,...,FortWorth,Albuquerque,Charleston,Tulsa,Memphis,Jacksonville,Knoxville,Albany,bullcity,DesMoines
0,spider-man (peter parker),0.196711,0.039547,0.110015,0.047906,0.064749,0.057869,0.015974,0.036523,0.091731,...,0.034947,0.029307,0.049395,0.076326,0.057715,0.083397,0.027994,0.071891,0.038358,0.02932
1,captain america (steven rogers),0.032615,0.031523,0.036269,0.018665,0.029989,0.074301,0.005809,0.023106,0.048121,...,0.011649,0.029307,0.006174,0.038163,0.051302,0.166795,0.020995,0.008986,0.0,0.048866
2,batman (bruce wayne),0.376605,0.282557,0.441873,0.620287,0.407576,0.505815,0.274464,0.247462,0.452638,...,0.227154,0.380986,0.117313,0.400712,0.301398,0.301514,0.160963,0.404385,0.143843,0.205238
3,"wolverine (james \""logan\"" howlett)",0.020385,0.026937,0.026597,0.044173,0.014313,0.057869,0.010891,0.023852,0.033835,...,0.011649,0.035168,0.012349,0.012721,0.032064,0.038491,0.020995,0.017973,0.028769,0.019547
4,"iron man (anthony \""tony\"" stark)",0.051471,0.037827,0.079187,0.042929,0.06543,0.054297,0.023235,0.031305,0.058647,...,0.023298,0.023445,0.024697,0.108129,0.032064,0.032076,0.027994,0.152768,0.00959,0.019547
5,superman (clark kent),0.09173,0.061326,0.166836,0.151805,0.109732,0.117166,0.037757,0.065592,0.109024,...,0.064069,0.064475,0.074092,0.101768,0.256509,0.16038,0.034992,0.062904,0.0,0.127052
6,thor (thor odinson),0.042808,0.04012,0.081,0.041062,0.092693,0.058583,0.014522,0.029069,0.068422,...,0.05242,0.046891,0.0,0.082687,0.051302,0.025661,0.034992,0.053918,0.028769,0.05864
7,benjamin grimm (earth-616),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,reed richards (earth-616),0.00051,0.0,0.0,0.001244,0.001363,0.000714,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,hulk (robert bruce banner),0.069307,0.051009,0.06891,0.062838,0.063386,0.102878,0.02977,0.046213,0.040602,...,0.017473,0.046891,0.018523,0.076326,0.051302,0.044906,0.027994,0.080877,0.047948,0.019547


In [11]:
# Extract number 1 city based on character mentions / member population
max_total = np.max(total[1:])
max_index = total.index(max_total)
best_city = comic_mention.columns[max_index]
print(best_city)

rva


## Discussion

The results say that the highest concentration of discussion around super heros to member population is Richmond Virginia. I'm familiar with that area and I agree. The results could be tweaked more, for example using population compared subreddit members and computing variance of populations to determine population density of each city.