## Transformative Work Sites: Popularity Comparison Project
### Rough Draft  - Version 8 (POST-CLASS)
Purpose: To compare the popularity of different fandoms on different sites, over a period of time  
Name(s): Alexis Vernon  
Date Created: 28 October 2019  
Last Updated: 6 January 2020

### Project Proposal
The proposal should address the following areas:  
     What problem are you solving?  
     What is the hypothesis or question addressed  
     Why the problem is worth solving/answering  
     What is the impact on society, industry, etc. if this problem is solved.  
     Where can you find data?  
     What needs to be done with the data to address the problem?  
     What models can be developed to answer the question, or provide predictions?  
     How can you demonstrate the project success?    
     What resources are required beyond a computer and the internet?  
     
I am unsure about my proposed project, but this was the one that I was most intereted in. Despite how it might not help me find a job in the future, it is something I am excited to research more about. I was interested in what fandoms are most popular and most active on which sites such fictional text written by fans of any work of fiction
  
1- FanFiction Sites (Fandom Statistics)  
One of the ideas for a project I was interested in was exmining fictional fan written works, commonly known as fanfiction, and seeing what fandoms (subculture group of fans around a particular work) are most popular and most active on which sites. Some of the siyes I would be examining would be popular sites such as Fanfiction.net, Archiveofourown.org, and Wattpad. There are a few others but these are the main three that I want to focus on examining. I would get data from this by web-scraping these sites, possibly over a (relatively short) period of time.
  
There are multiple sites that host fanfiction but some are more popular with certain fandoms than others, so I want to identify them.  
  
Answering this question would probably not have much of an impact on society, industry in the traditional sense for a variety of reasons. Fanfiction does have a stigma associated with it and there is not much common understanding or easy acceptance of it. There is more understanding and acceptance towards it in recent years, but not enough that I could say that it benefitted mainstream society. Fanfiction would not really benefit industry economically (outside of maybe focusing certain marketing to specific sites) because they are free works. However, this project could mainly benefit those who are looking for works in their fandom to read and where the ‘best’ or most popular place to read it would be. There is also not a lot of data about fanfiction trends in general so I think that this would be a good start to gathering that for possible research so hopefully it will be treated more seriously in the future.  

In order to get my answer I would need to get the number of works in each fandom on each site, and compare that over a period of time. (EX: 400 K works of Naruto stories on Fanfiction.net vs 300 K works of Naruto stories on ArchiveOfOurOwn.net on Date 1)
  
Because there are so many different fandoms, and so many numbers of work, across multiple sites I would need to utilise data frames. One way to demonstrate this data and get my answer to what sites are the most popular for what fandom could be a line graph displaying multiple lines. There are too many different fandoms to possibly do all of them at once, on one chart, so to model the data comparing three different sites over a period time would have to be a specific fandom. I could demonstrate my projects sucess if I can do this.
  
I am not sure what possible resources that I would need beyond a computer and the internet, except maybe an open mind?

In [None]:
# INSTALL PACKAGES
install.packages("syuzhet")
install.packages("selectr")
install.packages("textclean")

In [None]:
# IMPORT LIBRARIES
library(xml2)
library(rvest)
library(syuzhet)
library(stringr)
library(textclean)

### Functions

In [None]:
## FUNCTIONS ##

# BOTH Websites
only_latin <- function(List){
    # gets rid of non ASCII characters (ex. Japanese characters)
    x <- List
    Encoding(x) <- "latin1"
    x <- iconv(x, "latin1", "ASCII", sub="")
    return(x)
}
only_latin1 <- function(List){
    # gets rid of non ASCII characters (ex. Japanese characters)
    x <- List
    Encoding(x) <- "latin1"
    x <- iconv(x, from="UTF-8", to="LATIN1", sub="")
    return(x)
}

num_fandoms <- function(df){
    length(df$FANDOM)
}


# ArchiveOfOurOwn.org FUNCTIONS

get_fandoms <- function(html){
    # gets both the fandom name and the number of works
    html %>% html_nodes(".tags.index.group") %>% html_text() %>%
    unlist()
}


do_for_meA <- function(ao3urlA){
    # "takes in an Ao3 url page and returns a data.frame"

#read_html
ao3pageA <- read_html(ao3urlA)

# create lists
ao3_listA = get_fandomsA(ao3pageA)
fandoms = c()
amounts = c()

#remove whitespace
for (i in length(ao3_listA)){
    #str_split(string, pattern, [n: optional])
    #str_split_fixed(string, pattern, n)
    #listA2 <- str_split(ao3_listA, "\n              ")
    ao3_listA2 <- gsub("\n", "", ao3_listA)    # remove?
    ao3_listA2 <- gsub("   ", "", ao3_listA)   # remove?
    ao3_listA2 <- gsub("  ", "", ao3_listA)
} 

# merge into one list 
ao3_allA <- paste(ao3_listA2, collapse=', ' )
#ao3_allA

# Split list into individual fandoms
x <- strsplit(ao3_allA, "\n\n") #str_split_fixed(string, pattern, n)
x <- unlist(x)
#x

# divide into seperate vectors (fandom vector & work vector)
fandom = c()
works = c()
for (i in 1:length(x)) {
    # test if gets all in list range
    # print(x[i])
    
    # split fandom and work
    doe <- x[i]
    ray <- strsplit(doe, "\n" )
    me <- unlist(ray)
    
    fandom <- c(fandom, me[2])
    fandom <- only_latin(fandom)
    # number of works, converted to numeric type
    works <- c(works, me[3])
    works <- as.numeric(str_remove_all(works, "[()]"))
    
}

ao3A <- data.frame(FANDOM=fandom, AO3.WORKS = works, stringsAsFactors=FALSE)
return(ao3A)

}

# MORE FUNCTIONS!!!!!!!!!!!!!!!!!!!!!!!

# Functions
get_all <- function(html){
    # gets both the fandom name and the number of works
    html %>% html_nodes(".tags.index.group") %>% html_text() %>%
    unlist()
}

clean_string <- function(ao3_list){
    # purpose: to remove extra white space and "\n", and merge into one string
    for (i in length(ao3_list)){
        # remove white spacces
        ao3_list2 <- gsub("\n", "", ao3_list)     # remove?
        ao3_list2 <- gsub("   ", "", ao3_list)    # remove?
        ao3_list2 <- gsub("  ", "", ao3_list)
        }
    # merge into one list
    one_string <- paste(ao3_list2, collapse=', ' )
    one_string
}

clean_split <- function(one_string){
    # Split list into individual fandoms
    x <- strsplit(one_string, "\n\n") #str_split_fixed(string, pattern, n)
    x <- unlist(x)
    return(x)
    # returns a vector or strings, each one a string of "\nfandom\n(#)"
}

only_fandoms <- function(groups){
    # Split list into individual fandoms
    fandom = c()
    for (i in 1:(length(groups))) {   
        # split fandom and work
        doe <- groups[i]
        ray <- strsplit(doe, "\n")
        me <- unlist(ray)
        fandom <- c(fandom, me[2])
        fandom <- only_latin(fandom)

    }
    return(fandom)
}

only_works <- function(groups){
    # 
    works = c()
    for (i in 1:(length(groups))) {
        doe <- groups[i]
        ray <- strsplit(doe, "\n" )
        me <- unlist(ray)
        works <- c(works, me[3])
    }
    works <- as.numeric(str_remove_all(works, "[()]")) # number of works, converted to numeric type
    return(works)
}

In [None]:
# FanFiction.net Functions

get_all <- function(html){
      html %>% 
        html_nodes('#list_output') %>%      
        html_text() %>% 
        # Convert the list into a vector
        unlist()                             
    }

get_works <- function(html){
    # gets FanFiction.net Fandom names
      html %>% 
        html_nodes('.gray') %>%       # The class name
        html_text() %>% 
        # Convert the list into a vector
        unlist()                             
    }

get_fandomsF <- function(html){
    # gets FanFiction.net Fandom names
      html %>% 
        html_nodes('#list_output') %>%  # big list
        html_nodes('a') %>%             # narrows down
        html_text() %>% 
        # Convert the list into a vector
        unlist()                             
    }

replace_K <- function(single){
    if (grepl("K", single)){
        # remove ( ) k
        single <- str_replace_all(single, "[K(]", "")
        single <- sub(")", "", single)
        # convert numeric
        single <- as.numeric(single)
        #multiply by 1000
        single <- (single*1000)
    }
    single <- str_replace_all(single, "[K(]", "")
    single <- sub(")", "", single)
    single <- as.numeric(single)
    return(single)
}

only_latin <- function(List){
    # gets rid of non ASCII characters (ex. Japanese characters)
    x <- List
    Encoding(x) <- "latin1"
    x <- iconv(x, "latin1", "ASCII", sub="")
    return(x)
}

do_for_meF <- function(url){
    # url and html info
    page <- read_html(url)
    
    # get list of fandoms + clean
    n <- get_fandomsF(page)
    n <- only_latin(n)
    
    # get list of correspoding works --> df to replace
    w <- get_works(page) 
    ffA <- data.frame(FANDOM=n, WORKS = w)
    #ffA

    # convert works to correct numerical value
     # i loop
    ffworksA <- c()
    for (i in 1:(length(w))){
        new <- replace_K(w[i])
        ffworksA <- c(ffworksA, new)
    }
    #ffworksA
    ffA <- data.frame(FANDOM=n, FF.WORKS = ffworksA, stringsAsFactors=FALSE)
    #ffA
    return(ffA)
        
}


# Categories

## ANIME & MANGA

In [None]:
## AO3   WALKTHROUGH (without most functions)

# Functions
get_fandomsA <- function(html){
    # gets both the fandom name and the number of works
    html %>% html_nodes(".tags.index.group") %>% html_text() %>%
    unlist()
}

ao3urlA <- "https://archiveofourown.org/media/Anime%20*a*%20Manga/fandoms"
ao3pageA <- read_html(ao3urlA)

# create lists
ao3_listA = get_fandomsA(ao3pageA)
fandoms = c()
amounts = c()

#remove whitespace
for (i in length(ao3_listA)){
    #str_split(string, pattern, [n: optional])
    #str_split_fixed(string, pattern, n)
    #listA2 <- str_split(ao3_listA, "\n              ")
    ao3_listA2 <- gsub("\n", "", ao3_listA)    # remove?
    ao3_listA2 <- gsub("   ", "", ao3_listA)   # remove?
    ao3_listA2 <- gsub("  ", "", ao3_listA)
} 

# merge into one list 
ao3_allA <- paste(ao3_listA2, collapse=', ' )
#ao3_allA

# Split list into individual fandoms
x <- strsplit(ao3_allA, "\n\n") #str_split_fixed(string, pattern, n)
x <- unlist(x)
#x

# divide into seperate vectors (fandom vector & work vector)
fandom = c()
works = c()
for (i in 1:length(x)) {
    # test if gets all in list range
    # print(x[i])
    
    # split fandom and work
    doe <- x[i]
    ray <- strsplit(doe, "\n" )
    me <- unlist(ray)
    
    fandom <- c(fandom, me[2])

    fandomA <- only_latin(fandom)
    # number of works, converted to numeric type
    works <- c(works, me[3])
    worksA <- as.numeric(str_remove_all(works, "[()]"))
    
}

ao3A <- data.frame(FANDOM=fandomA, AO3.WORKS = worksA, stringsAsFactors=FALSE)
ao3A

In [None]:
po = which(grepl("Full", ao3A$FANDOM))
fa <- ao3A[po, ]
fa

In [None]:
# This function does the same as the above steps
do_for_meA("https://archiveofourown.org/media/Anime%20*a*%20Manga/fandoms")

In [None]:
num_fandoms(ao3A)

In [None]:
# FF WALKTHROUGH

url <- ("https://www.fanfiction.net/anime/?")
page <- read_html(url)
n <- get_fandomsF(page)
n <- only_latin(n)
w <- get_works(page)
ffA <- data.frame(FANDOM=n, WORKS = w)
#ffA
# i loop
ffworksA <- c()
for (i in 1:(length(w))){
    new <- replace_K(w[i])
    ffworksA <- c(ffworksA, new)
}
#ffworksA
ffA <- data.frame(FANDOM=n, FF.WORKS = ffworksA, stringsAsFactors=FALSE)
ffA

In [None]:
# same as stuff above
ffA <- do_for_meF("https://www.fanfiction.net/anime/?")
ffA

In [None]:
totalA <- merge(ffA,ao3A,by="FANDOM", all.x=TRUE, all.y=TRUE)
totalA$CATEGORY <- "Anime/Manga"
totalA$FANDOM[2]

In [None]:
totalA <- totalA[-1,]

In [None]:
summary(totalA)

In [None]:
po = which(grepl("Yuri", totalA$FANDOM))
fa <- totalA[po, ]
fa

In [None]:
po = which(grepl("A", totalA$FANDOM))
fa <- totalA[po, ]
fa

## BOOKS & LITERATURE

In [None]:
# BOOKS & LIT
ffB <- do_for_meF("https://www.fanfiction.net/book/")
ao3B <- do_for_meA("https://archiveofourown.org/media/Books%20*a*%20Literature/fandoms")
totalB <- merge(ffB,ao3B,by="FANDOM", all.x=TRUE, all.y=TRUE)
totalB$CATEGORY <- "Books/Literature"
print(length(totalB$FANDOM))
totalB <- totalB[-1,]
totalB

In [None]:
summary(totalB)

In [None]:
B <- merge(ao3B,ffB,by="FANDOM", all.x=TRUE, all.y=TRUE)
B
length(B$FF.WORKS)

In [None]:
#?rbind

In [None]:
X <- merge(ao3B,ffB,by="FANDOM", all.x=TRUE, all.y=TRUE)
X

In [None]:
x <- cbind()
x

## CARTOONS/COMICS
[COMICS][CARTOONS]  = [CARTOONS/COMICS/GRAPHIC NOVELS]

In [None]:
# COMICS, CARTOONS, GRAPHIC NOVELS

ffCa <- do_for_meF("https://www.fanfiction.net/cartoon/")
ffCo <- do_for_meF("https://www.fanfiction.net/comic/")
ffC <- rbind(ffCa, ffCo)

ao3C <- do_for_meA("https://archiveofourown.org/media/Cartoons%20*a*%20Comics%20*a*%20Graphic%20Novels/fandoms")
totalC <- merge(ffC,ao3C,by="FANDOM", all.x=TRUE, all.y=TRUE)
totalC$CATEGORY <- "Cartoon/Comic/GraphicNovel"
print(length(totalC$FANDOM))
totalC <- totalC[-1,]
totalC

In [None]:
summary(totalC)

## GAMES

In [None]:
# VIDEO GAMES
ffG <- do_for_meF("https://www.fanfiction.net/game/")
ao3G <- do_for_meA("https://archiveofourown.org/media/Video%20Games/fandoms")
totalG <- merge(ffG,ao3G,by="FANDOM", all.x=TRUE, all.y=TRUE)
totalG$CATEGORY <- "Games"
print(length(totalG$FANDOM))
totalG <- totalG[-1,]
#totalG

In [None]:
summary(totalG)

## MOVIES

In [None]:
# MOVIES
ffM <- do_for_meF("https://www.fanfiction.net/movie/")
ao3M <- do_for_meA("https://archiveofourown.org/media/Movies/fandoms")
totalM <- merge(ffM,ao3M,by="FANDOM", all.x=TRUE, all.y=TRUE)
totalM$CATEGORY <- "Movies"
print(length(totalM$FANDOM))
total <- totalM[-1,]
#totalM

In [None]:
summary(totalM)

## TV SHOWS

In [None]:
# TV shows
ffT <- do_for_meF("https://www.fanfiction.net/tv/")
ao3T <- do_for_meA("https://archiveofourown.org/media/TV%20Shows/fandoms")
totalT <- merge(ffT,ao3T,by="FANDOM", all.x=TRUE, all.y=TRUE)
totalT$CATEGORY <- "TV Shows"
print(length(totalT$FANDOM))
totalT <- totalT[-1,]
#totalT

In [None]:
summary(totalT)

## PLAYS & THEATER

In [None]:
# PLAYS & THEATER
ffP <- do_for_meF("https://www.fanfiction.net/play/")
ao3P <- do_for_meA("https://archiveofourown.org/media/Theater/fandoms")
totalP <- merge(ffP,ao3P,by="FANDOM", all.x=TRUE, all.y=TRUE)
totalP$CATEGORY <- "Plays/Theater"
print(length(totalP$FANDOM))
totalP <- totalP[-1,]
#totalP

In [None]:
summary(totalP)

## TOTALS - MASTER LIST

In [None]:
master7 <- rbind(totalA, totalB, totalC, totalG, totalM, totalT, totalP) 

In [None]:
print(length(master7$FANDOM))

In [None]:
master7

In [None]:
summary(master7)

In [None]:
print("ArchiveOfOurOwn.org Total Works: ")
sum(na.omit(master7$AO3.WORKS))
print("FanFiction.net Total Works: ")
sum(master7$FF.WORKS)

# SEARCH METHODS

In [None]:
# KNOW EXACT FANDOM TITLE 

# search method
position = which(master7$FANDOM=="Naruto")
       # give number of works in that fandom  <-??
found <- master7[position, ]
found

In [None]:
# TO FIND ANY THAT CONTAIN PARTIAL MATCHES

#  ***SEARCH METHOD*** !!!!
position = which(grepl("yuri", master7$FANDOM))
# give number of works in that fandom
found <- master7[position, ]
found


In [None]:
# TO FIND ANY THAT CONTAIN PARTIAL MATCHES

##SEARCH METHOD!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
position = which(grepl("the end", master7$FANDOM))
# give number of works in that fandom
found <- master7[position, ]
found

## ++++++++++++++++++SORTED+++++++++++++++++

# PLOTTING

In [None]:
# Categories stacked barplot
count <- table(master7$CATEGORY)
barplot(count, main="Works Distribution",
       xlab = "CATEGORIES", col=c("red", "blue", "yellow", "green", "lavender", "black", "purple"),
        ylab= "Number of Fandoms",
       legend = rownames(count), args.legend = list(x="bottomleft"))


In [None]:
count <- table(master7$FF.WORKS, master7$AO3.WORKS)
barplot(count, main="Work Distribution between Categories",
       xlab = "Works", col=c("blue", "red"))

In [None]:
#table(master7$FF.WORKS, totalA$FF.WORKS)
#barplot(count, main="Work Distribution between Categories",
 #      xlab = "Works", col=c("blue", "red"))

### -------POST-CLASS ADDITIONS-------

In [None]:
# Establish the location of where I want to save a copy of master7 dataframe; also named by date so not overwrite
time <- paste("C:\\Users\\Alexis.000\\Desktop\\ENGR 122 - Morris\\FanFiction Project Folder\\Record\\", Sys.Date(), sep = "")
here <- paste(time, ".csv", sep = "")
here
# Create CSV file
write.csv(master7, here)

In [None]:
# TO FIND ANY THAT CONTAIN PARTIAL MATCHES
#  ***SEARCH METHOD*** !!!!
position = which(grepl("All Media Types", master7$FANDOM))
found <- master7[position, ]
found