Skip to content

Latest commit

 

History

History
307 lines (219 loc) · 25.6 KB

B.Sc-Research.md

File metadata and controls

307 lines (219 loc) · 25.6 KB

{::options parse_block_html="true" /}

We offer the following research projects for bachelor students-

Project Title: Comics Illustration Synthesizer using Generative Adversarial Networks

Responsible Professor: Lydia Y. Chen


Introduction

PhD Comics is a newspaper and webcomic strip written and drawn by Jorge Cham that follows the lives of several grad students. First published in 1997 when Cham was a grad student himself at Stanford University, the strip deals with issues of life in graduate school, including the difficulties of scientific research, the perils of procrastination, and the complex student - supervisor relationship.

Figure 1: An example from PhD Comics

Specific problem scenario: We know that drawing illustrations is a time-consuming and ex- pensive process, so we wonder whether we can use machine learning algorithms to learn illustrations through dialogue in comics or descriptions of illustrations, using dialogue and descriptions of illustra- tions to build generative adversarial networks (GAN). Thus, we will first build a text-image pair dataset by extracting dialogue and descriptions of illustrations from existing comics. Based on this dataset, we will adopt transfer learning and text-to-image generation approaches to build a text visualization model. And lastly, we will conduct both an automatic as well as a human evaluation. Objective:In this project, we aim to develop a text-to-image generative adversarial network for generating comic illustrations. To be concrete on the comic’s style, we specifically focus on PhD Comics.

The whole project involves two main aspects: (1) Data preparation and (2) Model construction. In the data preparation step, to automatically collect data, students must align the text with images and classify the characters and comic formats (e.g., 2, 3, or 4 grids). This work is not trivial and requires the help of machine learning. Furthermore, as we cannot simply enter the raw text into the machine learning model, a proper text representation is needed. Hence, the Bidirectional Encoder Representations from Transformers (BERT) is recommended to achieve this goal. As our study case is comics, an interesting branch of humour detection using BERT is also proposed. Then, in model construction step, two types of model are proposed. First, we propose to construct a comics-to-comics conditional GAN. This model will explore the method that uses comic illustrations to generate comic illustrations. Second, we propose to implement the text-to-image GAN, which uses text to generate comic illustrations. Novelty:Previous work such as [6, 5] use text to generate images, the results on standard datasets are already promising (as shown in Fig. 2). Therefore in our proposal, we want to build our own dataset. Since the characters in the comics are human, it will concerns about the emotions of the characters, the actions of bodies, which makes the jobs more difficult than generating still objects, this can also be seen in the 5th column of generated images in Fig. 2.

Figure 2: Examples of images generated by (a) AttnGAN, (b) MirrorGAN Baseline, and (c) MirrorGAN conditioned on text descriptions from CUB and COCO test sets and (d) the corresponding ground truth.

Testbed and baseline

The codes of MirrorGAN and AttnGAN are provided with the datasets that they applied on. Therefore, reproducing their results can be a good start to get familiar with the text-to-image GAN structure. And once we collect our PhD Comics text and images aligned dataset, it should be easy to adapt the data format to apply on these algorithms.

1 Text-image dataset construction - extracting text from image

  • Research Question: In this step, text-image pairs should be aligned. Students need to first determine the information used for the dataset, it can be pure dialogue (in the illustrations) to illustrations, or pure descriptions (of the illustrations) to illustrations. Then the pipeline of auto- downloading all the PhD Comics, auto-segmenting the sub-figure from each comic’s illustration and auto-extract text from each comics should be developed.

  • Method: Web Crawling using python3 can help to download the comics, Optical Character Recognition (OCR) tool (e.g., Google’s tesseract-OCR Engine) can be used to extract text from each comics.

  • Outcome: a data pipeline to auto-build the text-to-image dataset based on PhD Comics illus- trations.

  • Related work: the paper[4] which automatically builds a large celebrity face dataset (i.e., Face- scrub) can be a good reference to learn the process.

  • Timeline: 2 weeks to explore web crawling, 3 weeks to explore segment figures, 3 weeks to explore extracting text from images, 2 weeks to analyze the results and write report.

2 Text-image dataset construction - auto labeling

  • Research Question: In PhD Comics, there are several specific characters (e.g. Cecilia, Mike, etc. Details here) in every painting. Students need to annotate them in each painting. The encoded characters can then be used as a conditional vector for generative adversarial network.
  • Method: It’s recommend to use openCV library in python to extract all faces from illustrations. After that step. First, manually clear up a small part of the data. Then train a convolutional neural network (CNN) to do the classification on the rest of images.
  • Outcome: a data pipeline to auto-label the images in PhD Comics.
  • Related work: the paper[4] that built Facescrub dataset involves part of jobs to extract faces from images.
  • Timeline: 2 weeks to reproduce the results in [4], 4 weeks to build a small clean annotated dataset and use this dataset to build CNN, 2 weeks to use the CNN to auto-label rest data and test, 2 weeks to analyze the results and write report.

3 Context representation and humour detection

  • Research Question: The goal is to learn good representations of context that help image gener- ation. Students will use a pre-train BERT model, and the output of BERT can be used as the encoded vector of text. Furthermore, since from Research Question 1, students will extract the dialogue or descriptions from comics. Therefore, in this step, students can use BERT to encode text and train a humour detector.
  • Method: Use pre-train BERT to encode extracted text from comics. Then use the structure of ColBERT [1]: Using BERT Sentence Embedding for Humor Detection, and doing the classifica- tion on PhD Comics text. ColBERT code provides a dataset with 200k formal short texts (100k positive, 100k negative).
  • Outcome: a dialogue representation encoder using pre-train BERT, and a text humor detector built by ColBERT.

Figure 3: An illustration of a generative adversarial network (GAN) learns to generate handwriting numbers from MNIST dataset. GAN consists of one generator and one discriminator.
  • Related work: There is an implementation of Bert-as-service which has already integrated pre- train english BERT. ColBERT implementation is provided in this [git](! https://github.com/Moradnejad/ColBERT-Using-BERT-Sentence-Embedding-for-Humor-Detection) , but apparently, we should add our own context to train a new humor detector.
  • Timeline: 2 weeks to integrate Bert-as-service into python code and test on PhD Comics context, 2 weeks to annotate PhD Comics context, 4 weeks to build our own ColBERT, 2 weeks to analyze the results and write report.

4 Conditional-GAN for comics-to-comics generation

  • Research Question: The goal is to implement a comics-to-comics conditional GAN (CGAN) [3], which uses comic illustrations to generate comic illustrations. The conditional vector can be used to indicate the characters and the comic’s format. Additionally, many GAN models suffer the problems such as: (1) non-convergence, (2) mode collapse and (3) diminished gradient. Therefore, there are several methods worth exploring.
  • Method: The state-of-the-art method that integrates Wasserstein distance and gradient penalty (WGAN + GP) [2] should be implemented.
  • Outcome: a comics-to-comics generative adversarial network
  • Related work: A Conditional Wasserstein GANs with Gradient Penalty demo on MNIST hand- writing images dataset is provided here. Henceforth, students need to adopt these techniques on the PhD Comics dataset and fine-tune the parameters.
  • Timeline: 2 weeks to understand the structure of GAN, 2 weeks to run a GAN demo on standard image dataset such as MNIST or CIFAR10, 4 weeks to adapt PhD Comics illustration format to WGAN + GP structure, 2 weeks to analyze the results and write report.

5 Text-to-image generative adversarial network

  • Research Question: The goal is to firstly reproduce the state-of-the-art models for text-to-image generation tasks, and finally adapt PhD Comics illustration to train the model and generate comics illustration.
  • Method: The state-of-the-art algorithms such as MirrorGAN [5] and AttnGAN [6] are definitely need to be reproduced. The first step is to use the standard datasets provided with the codes to reproduce the results in original papers. The current results based on these standard datasets are as shown in Fig. 2. Then adapt the data format of PhD Comics illustration to train the algorithms and get our own models.
  • Outcome: one/two text-to-image generative adversarial network which can using comics dialogue or descriptions to generate comics illustration.
  • Related work: Codes for MirrorGAN and AttnGAN are provided on github along with their using datasets.
  • Timeline: 2 weeks to understand the structure of GAN, 4 weeks to reproduce the results of MirrorGAN and AttnGAN, 2 weeks to adapt PhD Comics illustration format to these two algo- rithms, 2 weeks to analyze the results and write report.

Relation Between Research Questions

Successfully answering each of these research questions can lead to the ultimate objective: efficient collect and build image dataset with annotation, and have a deep insight into generative adversarial network. Toward the end of the project, each research topic can benefit from the findings of each other and move away from the baseline configuration. A thorough investigation of each topic can be a stand-alone workshop paper, and the combination of them can form a conference paper for the machine learning application track. This project will be in collaboration with a PhD student and master student who will provide the baseline system.

Project Title: Time Series Synthesis using Generative Adversarial Networks

Responsible Professor: Lydia Y. Chen


Introduction

'Data is the new oil' is a quote that goes back to 2006, which is credited to mathematician Clive Humby. It has recently picked up more steam after The Economist published a 2017 report[13] titled 'The world's most valuable resource is no longer oil, but data'. Many companies nowadays discover valuable business insights from various internal and external data sources. However, the big knowledge behind big data often impedes personal privacy and leads to unjustified analysis[14]. To prevent the abuse of data and the risks of privacy breaching, the European Commission introduced the European General Data Protection Regulation (GDPR) and enforced strict data protection measures. This however instils a new challenge in the data-driven industries to look for new scientific solutions that can empower big discovery while respecting the constraints of data privacy and governmental regulation.

Figure 1: The synthetic data retains the structure of the original data but is not the same [11].

An emerging solution is to leverage synthetic data, which statistically resembles real data and can comply with GDPR due to its synthetic nature. The industrial datasets (at stakeholders like banks, insurance companies, and health care) present multi-fold challenges. Generative Adversarial Network (GAN)[8] is one of the emerging data synthesizing methodologies. The GAN is first trained on a real dataset. Then used to generate data. Beyond its success in generating images, GAN has recently been applied to generate time series[10,12], and this is also our target for this project.

This project will be split into 5 parts, and it involves three topics: (1) time series generation using GAN, (2) differential privacy in GAN, (3) distributed GAN framework and (4) federated learning GAN framework. For each part of this project, we specifically select one scientific paper and its code. We expect the students can follow the paper and the code, and reproduce the results.

Testbed and baseline

The Time-GAN and DoppelGANger are the codebases for time series generation for [10,12]. GS-WGAN is the codebase for implementing differential privacy in GAN [7]. MD-GAN is the codebase for our selected distributed gan framework[9]. FeGAN is the code for one type of federated learning GAN framework.

Prerequisite

This project is mainly focusing on generative adversarial networks. And the available code are mainly implemented on Python. It needs the student to have experiences to use Python, especially the knowledges of Pytorch library. We will provide a crash course training and materials on generative adversarial networks.

1 Time-series Generative Adversarial Networks (TimeGAN) [12]

  • Research Question: The objective is to utilize the TimeGAN[12] algorithm to produce real time-series data. Students must evaluate the TimeGAN algorithm and reproduce table 2 in the paper. It is important for students to make comparison only with alternative GAN approaches such as RCGAN, C-RNN-GAN and WaveGAN.

  • Method: Students must go through the paper and understand the challenges of modeling time series data via GANs. Students must reproduce table 2 of the paper and compare TimeGAN only with other GAN based approaches such as RCGAN, C-RNN-GAN and WaveGAN. In-case one does not have enough background knowledge about GANs, a basic study on GAN is fundamental.

  • Outcome: A data pipeline that can take real time-series data as input and produce the corresponding realistic generated time series.

  • Related work: The paper is linked here, and code is provided here.

  • Timeline: 2 weeks to learn the basics of GAN, 1 week to learn time series, 5 weeks to study the code and reproduce the results in the paper. 2 weeks to analyze the results and write a report.

2 Time-series Generative Adversarial Networks (DoppelGANger) [10]

  • Research Question: The objective is similar to Research question 1, but in this part, we will use a different algorithm known as DoppelGANger[10] to produce real time-series data. This algorithm is equipped to utilize even the meta-data that is commonly found along-side time-series data to boost the quality of the generated time-series data. The students must compare the performance of the DoppelGANger method with other machine learning models such as autoregressive models, Markov models, and recurrent neural networks (RNNs). Hence, the table 3 of the paper[10] must be reproduced but with only the above mentioned algorithms being considered.

  • Method: Students are encouraged to read the paper to get a grasp of the state of the art techniques used to generate realistic time-series data conditioned on static meta-data. Naturally, if a student does not have background knowledge on GANs, a basic study on the topic is also needed.

  • Outcome: A data pipeline that can take real time series data as input and produce the corresponding realistic generated time series along with a reproduction of table 3 in the paper[10]

  • Related work: The paper can be found here, and code is linked as well.

  • Timeline: 2 weeks to learn the basics of GAN, 1 week to learn time series, 5 weeks to study the code and reproduce the results in the paper. 2 weeks to analyze the results and write a report.

3 A Gradient-Sanitized Approach for Learning Differentially Private Generators (GS-WGAN) [7]

  • Research Question: The goal is to train a differentially private GAN. This allows releasing a sanitized form of sensitive data with rigorous privacy guarantees. Students must attempt to reproduce the table 1 from the paper and are only required to make the comparison with the DP-SGD GAN model. Lastly, the developed algorithm must also be applied to see if it works well with time-series data.

  • Method: Use the GS-WGAN framework provided in [7].

  • Outcome: Understand the paper[7] and reproduce the results of table 1 by comparing against the DP-SGD GAN model only.

  • Related work: The paper's is here, the codebase is provided in this git.

  • Timeline: 1 week to learn the concept of differential privacy, 2 weeks to learn about the GAN framework, 4 weeks to reproduce the results, 1 week to apply on time series data, 2 weeks to analyze the results and write a report.

4 Distributed Generative Adversarial Networks for Distributed Datasets (MD-GAN) [9]

  • Research Question: As a specific feature of GANs, the training of the generator does not require access to the real data which is privacy-sensitive. Therefore, a distributed GAN can be developed to give a higher privacy-preserving level wherein the discriminator and generator are trained on separate machines such that only the discriminator is given access to the real data. In this way, the real data which is sensitive doesn't leave the premise of the user. Hence, for this part, we challenge students to build such a distributed GAN framework where the generator and discriminator are trained on separate machines and must communicate across a shared network. Finally, the implementation must be utilized for generating time series data.
  • Method: The current solution proposed is MD-GAN[9]. However, it makes use of multiple discriminators. Our requirement for this part is to simply use one generator and one discriminator, with the MD-GAN solution as a base reference.
  • Outcome: A distributed GAN framework
  • Related work: MD-GAN paper and code are all provided. Interested students can also look up-Generative Models for Effective ML on Private, Decentralized Datasets ( https://arxiv.org/abs/1911.06679) and FeGAN: Scaling Distributed GANs (https://hal.archives-ouvertes.fr/hal-03118260/document) for expanding their ideas.
  • Timeline: 2 weeks to understand the structure of GAN, 5 weeks to develop distributed GAN framework, 1 week to test on time series data, 2 weeks to analyze the results and write a report.

5 Federated Learning Generative Adversarial Netowkrs (FeGAN)

  • Research Question: For the scenario where each client has part of training data and together with other clients' data, they can build a stronger model, federated learning is proposed. In this way, the real data which is sensitive doesn't leave the premise of the client, and there exists a server which is used to aggregate all clients' local model to one. For this part, we challenge students to build a federated learning GAN framework, where the generator and discriminator of GAN are all trained on clients, and the server aggregates the models from all clients to one GAN model. Finally, the implementation must be utilized for generating time series data.
  • Method: The current solution proposed is FeGAN[15]. FeGAN has more functionalities than the vanilla federated learning framework. Our requirement is to train GAN in federated learning style with one server, two clients. Training data are all in the clients.
  • Outcome: A federated learning GAN framework
  • Related work: FeGAN paper and code are all provided.
  • Timeline: 2 weeks to understand the structure of GAN, 5 weeks to develop distributed GAN framework, 1 week to test on time series data, 2 weeks to analyze the results and write a report.

Relation Between Research Questions

Successfully answering each of these research questions can lead to the ultimate objective: The first two parts focus on improving the utility for ML applications and statistical similarity of synthetic time series data. While the third part focuses on enhancing privacy guarantees for the synthetic data. And finally, the fourth part focuses on improving the training framework of GANs to provide stricter privacy by design.

References

[1] Issa Annamoradnejad. ColBERT: Using BERT Sentence Embedding for Humor Detection.arXiv e-prints, page arXiv:2004.12765, April 2020.

[2] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5767–5777, 2017.

[3] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.CoRR, abs/1411.1784, 2014.

[4] Hongwei Ng and Stefan Winkler. A data-driven approach to cleaning large face datasets. In 2014 IEEE International Conference on Image Processing, ICIP 2014, Paris, France, October 27-30, 2014 , pages 343–347. IEEE, 2014.

[5] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 1505–1514. Computer Vision Foun- dation / IEEE, 2019.

[6] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial net- works. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1316–1324. IEEE Computer Society, 2018.

[7] Dingfan Chen, Tribhuvanesh Orekondy and M. Fritz. GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators. 2020.

[8] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, 2014.

[9] Corentin Hardy, Erwan Le Merrer and Bruno Sericola. MD-GAN: Multi-Discriminator Generative Adversarial Networks for Distributed Datasets

[10] Zinan Lin, Alankar Jain, Chen Wang, Giulia Fanti, and Vyas Sekar. Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions. In Proceedings of the ACM Internet Measurement Conference (IMC '20). Association for Computing Machinery, New York, NY, USA, 2020.

[11] Karen Walker Synthetic data: Unlocking the power of data and skills for machine learning https://dataingovernment.blog.gov.uk/2020/08/20/synthetic-data-unlocking-the-power-of-data-and-skills-for-machine-learning/

[12] Jinsung Yoon, Daniel Jarrett and M. V. D. Schaar. Time-series Generative Adversarial Networks. NeurIPS, 2019.

[13] The Economist. The world’s most valuable resource is no longer oil, but data. https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuableresource-is-no-longer-oil-but-data, 2017.

[14] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In IEEE Symposium on Security and Privacy, pages 111–125, 2008.

[15] Rachid Guerraoui, Arsany Guirguis, Anne-Marie Kermarrec, and Erwan Le Merrer. 2020. FeGAN: Scaling Distributed GANs. In Proceedings of the 21st International Middleware Conference. Association for Computing Machinery, New York, NY, USA, 2020.